Interleaving - SIGIR 2016 presentation

Interleaved Evaluation for Retrospective Summarization

and Prospective Notification on Document Streams

Xin Qian, Jimmy Lin, Adam Roegiest

David R. Cheriton School of Computer Science

University of Waterloo

Monday, July 18, 2016

Volunteering Rio 2016 Summer Olympics

Source: http://www.phdcomics.com/comics/archive_print.php?comicid=1414

Retrospective summarization Tweet Timeline Generation, TREC

2014

Now Future

Prospective notificationPush Notification Scenario, TREC 2015

More than Microblog…• RSS feed, social media posts, medical record

updates

• Controlled experiments in A/B test • Random noise between buckets• Low question bandwidth

Evaluation Methodology: A/B test

Evaluation Methodology: A/B test vs. interleaving

• Within-subject design in interleaving• Better sensitivity • Prompt experiment progress

Interleaved Evaluation Methodology• How exactly do we interleave the output of

two systems into one single output?• Balanced interleaving [1]• Team-Draft interleaving [2] • Probabilistic interleaving [3]

• How do we assign credit to each system in response to user interactions with the interleaved results?• Aggregate over user clicks [1, 2, 3]• More sophisticated click aggregation strategies

[4, 5, 6]Source: [1] T. Joachims. Optimizing search engines using clickthrough data. KDD, 2002. [2] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reect retrieval quality? CIKM, 2008. [3] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. CIKM, 2011. [4] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. TOIS, 2012. [5] F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. SIGIR 2010. [6] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. SIGIR 2010.

Temporality

Source: https://unsplash.com/photos/ft0-Xu4nTvA

Existing Approaches

Temporality

Balanced interleavingTeam-Draft interleavingProbabilistic interleaving

Not satisfied

Verbosity

Source: https://unsplash.com/photos/QlnUpMED6Qs

Existing Approaches

Temporality Verbosity


Not satisfied Could be satisfied but largely biased

Source: https://unsplash.com/photos/jqlXLIj3aS8

Redundancy

Existing Approaches

Temporality Verbosity Redundancy


Not satisfied Could be satisfied but largely biased

Existing Approaches

Temporality Verbosity Redundancy


?

Not satisfied Could be satisfied but largely biased Satisfied

Temporal Interleaving

• Union and sort by document timestamp T• Prospective notifications, T = tweet push

time• E.g. Neu-IR 2016

System A results

System B results

742982648566718202

742853770786425799

743582679184237590

743682473083990521

743682506294099968

743682673084772352

743682473083990521

De-duplication

System A results

System B results

742982648566718202

742853770786425799

743582679184237590

743682473083990521

743682506294099968

743682673084772352

743682473083990521

De-duplication matters!

User Interaction - Explicit judgments

Not relevant

Relevant

Relevant but redundant

System A credit

System B credit

System A credit

System B credit

System A credit

System B credit

System A credit

System B credit

+1

System A credit

System B credit

+1

System A credit

System B credit

+1

“Masking Effect”

System A credit

System B credit

+1

Solution: discount factor = the fraction of relevant or redundant tweets above that come from the other system


System A credit

System B credit

+1

+1

System A credit

System B credit

+1

Alternative: asking user the source of the redundancy


System A credit

System B credit

+1

+1

System A credit

System B credit

+1

+1

System A credit

System B credit

+1

+1

System A credit

System B credit

+1

+1

+1

System A credit

System B credit

+1

+1

+1

System A credit

System B credit

+1

+1

+1

+0.66

Simulation-based Meta-evaluation: Dataset

• Grounded in the Microblog track at TREC• TREC 2015 real-time filtering task, push

notification• 14 groups for 37 runs• 1 semantic cluster annotation

• Normalized cumulative gain (nCG): • is the maximum possible gain

6:28pm At least seven killed in shooting at Sikh temple in Wisconsin

6:10pm >= 7 killed in shooting @ Sikh temple in Wisconsin

6:10pm 4 were shot inside the Sikh Temple of Wisconsin and 3 outside, including a gunman killed by police

Four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police around 6:10pm

6:28pm seven Wisconsin shooting at Sikh temple

6:10pm four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police6:10pm four

were shot inside the Sikh Temple and 3 outside, including a gunman killed by police

• Apply the temporal interleaving strategy• Assume a user with the semantic cluster

annotation• Difference of Assigned credit• Difference of nCG

Simulation-based Meta-evaluation: Experiment Setup

Correct

Correct Incorrect

Incorrect

Agreement =

Conclusions•

• Effectiveness: 92%-93% agreement

• Highest effectiveness: all pairs, binary relevance on “simple task” *• Effect of using different assessors *• Cluster importance and cluster weight *• Effect of “quiet days” and treatment * • Recall-oriented credit assignment * • Precision/recall tradeoff *

Agreement =

Interleaved Evaluations (nearly) doubles the work!

Criticism

Assessor Effort: Output Length

• Solution: Flip a biased coin and keep with probability pNo extra work

Still reasonable

Source: [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result preferences. SIGIR, 2006. [2] O. Chapelle and Y. Zhang. A Dynamic Bayesian Network click model for web search ranking. WWW, 2009. [3] L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. SIGIR, 2004. [4] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS, 25(2):1{27, 2007. [5] D. Kelly. Understanding implicit feedback and document preference: A naturalistic user study. SIGIR Forum, 38(1):77{77, 2004. [6] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18{28, 2003.

• Implicit judgments• Web search: click models[1, 2], eye-tracking

studies[3, 4], etc [5, 6]. • multi-media elements, different types of clicks…...

Assessor Effort: Explicit Judgments

• Explicit judgments: dismiss/click on notifications


• Solution: Pay attention with probability r• Good prediction accuracy with limited user

interactions.


No extra workStill reasonable

Assessor Effort: Combining Both

• Randomly discarding system output + limited user interactions• Accuracy and verbosity tradeoff curves

Summary• A novel interleaved evaluation

methodology• A temporal interleaving strategy• A heuristic credit assignment method • A user interaction model with explicit

judgments• A simulation-based meta-evaluation• Analysis on assessor effort• Output length• Explicit judgments

Participant Systems

Baseline System

YoGosling

Twitter Streaming API

TREC RTS Server

Mobile App Assessors

Interleaved

Evaluation

Human-in-the-loop assessment

Questions?

Social Media

Interleaving - SIGIR 2016 presentation