46
Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams Xin Qian, Jimmy Lin, Adam Roegiest David R. Cheriton School of Computer Science University of Waterloo Monday, July 18, 2016

Interleaving - SIGIR 2016 presentation

Embed Size (px)

Citation preview

Page 1: Interleaving - SIGIR 2016 presentation

Interleaved Evaluation for Retrospective Summarization

and Prospective Notification on Document Streams

Xin Qian, Jimmy Lin, Adam Roegiest

David R. Cheriton School of Computer Science

University of Waterloo

Monday, July 18, 2016

Page 2: Interleaving - SIGIR 2016 presentation

Volunteering Rio 2016 Summer Olympics

Source: http://www.phdcomics.com/comics/archive_print.php?comicid=1414

Retrospective summarization Tweet Timeline Generation, TREC

2014

Now Future

Prospective notificationPush Notification Scenario, TREC 2015

More than Microblog…• RSS feed, social media posts, medical record

updates

Page 3: Interleaving - SIGIR 2016 presentation

• Controlled experiments in A/B test • Random noise between buckets• Low question bandwidth

Evaluation Methodology: A/B test

Page 4: Interleaving - SIGIR 2016 presentation

Evaluation Methodology: A/B test vs. interleaving

• Within-subject design in interleaving• Better sensitivity • Prompt experiment progress

Page 5: Interleaving - SIGIR 2016 presentation

Interleaved Evaluation Methodology• How exactly do we interleave the output of

two systems into one single output?• Balanced interleaving [1]• Team-Draft interleaving [2] • Probabilistic interleaving [3]

• How do we assign credit to each system in response to user interactions with the interleaved results?• Aggregate over user clicks [1, 2, 3]• More sophisticated click aggregation strategies

[4, 5, 6]Source: [1] T. Joachims. Optimizing search engines using clickthrough data. KDD, 2002. [2] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reect retrieval quality? CIKM, 2008. [3] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. CIKM, 2011. [4] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. TOIS, 2012. [5] F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. SIGIR 2010. [6] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. SIGIR 2010.

Page 6: Interleaving - SIGIR 2016 presentation

Temporality

Source: https://unsplash.com/photos/ft0-Xu4nTvA

Page 7: Interleaving - SIGIR 2016 presentation

Existing Approaches

Temporality

Balanced interleavingTeam-Draft interleavingProbabilistic interleaving

Not satisfied

Page 8: Interleaving - SIGIR 2016 presentation

Verbosity

Source: https://unsplash.com/photos/QlnUpMED6Qs

Page 9: Interleaving - SIGIR 2016 presentation

Existing Approaches

Temporality Verbosity

Balanced interleavingTeam-Draft interleavingProbabilistic interleaving

Not satisfied Could be satisfied but largely biased

Page 10: Interleaving - SIGIR 2016 presentation

Source: https://unsplash.com/photos/jqlXLIj3aS8

Redundancy

Page 11: Interleaving - SIGIR 2016 presentation

Existing Approaches

Temporality Verbosity Redundancy

Balanced interleavingTeam-Draft interleavingProbabilistic interleaving

Not satisfied Could be satisfied but largely biased

Page 12: Interleaving - SIGIR 2016 presentation

Existing Approaches

Temporality Verbosity Redundancy

Balanced interleavingTeam-Draft interleavingProbabilistic interleaving

?

Not satisfied Could be satisfied but largely biased Satisfied

Page 13: Interleaving - SIGIR 2016 presentation

Temporal Interleaving

• Union and sort by document timestamp T• Prospective notifications, T = tweet push

time• E.g. Neu-IR 2016

Page 14: Interleaving - SIGIR 2016 presentation

System A results

System B results

742982648566718202

742853770786425799

743582679184237590

743682473083990521

743682506294099968

743682673084772352

743682473083990521

De-duplication

Page 15: Interleaving - SIGIR 2016 presentation

System A results

System B results

742982648566718202

742853770786425799

743582679184237590

743682473083990521

743682506294099968

743682673084772352

743682473083990521

De-duplication matters!

Page 16: Interleaving - SIGIR 2016 presentation

User Interaction - Explicit judgments

Not relevant

Relevant

Relevant but redundant

Page 17: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

Page 18: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

Page 19: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

Page 20: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

Page 21: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

Page 22: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

“Masking Effect”

Page 23: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

Solution: discount factor = the fraction of relevant or redundant tweets above that come from the other system

“Masking Effect”

Page 24: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

Page 25: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

Alternative: asking user the source of the redundancy

“Masking Effect”

Page 26: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

Page 27: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

Page 28: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

Page 29: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

+1

Page 30: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

+1

Page 31: Interleaving - SIGIR 2016 presentation

System A credit

System B credit

+1

+1

+1

+0.66

Page 32: Interleaving - SIGIR 2016 presentation

Simulation-based Meta-evaluation: Dataset

• Grounded in the Microblog track at TREC• TREC 2015 real-time filtering task, push

notification• 14 groups for 37 runs• 1 semantic cluster annotation

• Normalized cumulative gain (nCG): • is the maximum possible gain

6:28pm At least seven killed in shooting at Sikh temple in Wisconsin

6:10pm >= 7 killed in shooting @ Sikh temple in Wisconsin

6:10pm 4 were shot inside the Sikh Temple of Wisconsin and 3 outside, including a gunman killed by police

Four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police around 6:10pm

6:28pm seven Wisconsin shooting at Sikh temple

6:10pm four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police6:10pm four

were shot inside the Sikh Temple and 3 outside, including a gunman killed by police

Page 33: Interleaving - SIGIR 2016 presentation

• Apply the temporal interleaving strategy• Assume a user with the semantic cluster

annotation• Difference of Assigned credit• Difference of nCG

Simulation-based Meta-evaluation: Experiment Setup

Page 34: Interleaving - SIGIR 2016 presentation

Correct

Correct Incorrect

Incorrect

Agreement =

Page 35: Interleaving - SIGIR 2016 presentation
Page 36: Interleaving - SIGIR 2016 presentation
Page 37: Interleaving - SIGIR 2016 presentation

Conclusions•

• Effectiveness: 92%-93% agreement

• Highest effectiveness: all pairs, binary relevance on “simple task” *• Effect of using different assessors *• Cluster importance and cluster weight *• Effect of “quiet days” and treatment * • Recall-oriented credit assignment * • Precision/recall tradeoff *

Agreement =

Page 38: Interleaving - SIGIR 2016 presentation

Interleaved Evaluations (nearly) doubles the work!

Criticism

Page 39: Interleaving - SIGIR 2016 presentation

Assessor Effort: Output Length

• Solution: Flip a biased coin and keep with probability pNo extra work

Still reasonable

Page 40: Interleaving - SIGIR 2016 presentation

Source: [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result preferences. SIGIR, 2006. [2] O. Chapelle and Y. Zhang. A Dynamic Bayesian Network click model for web search ranking. WWW, 2009. [3] L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. SIGIR, 2004. [4] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS, 25(2):1{27, 2007. [5] D. Kelly. Understanding implicit feedback and document preference: A naturalistic user study. SIGIR Forum, 38(1):77{77, 2004. [6] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18{28, 2003.

• Implicit judgments• Web search: click models[1, 2], eye-tracking

studies[3, 4], etc [5, 6]. • multi-media elements, different types of clicks…...

Assessor Effort: Explicit Judgments

Page 41: Interleaving - SIGIR 2016 presentation

• Explicit judgments: dismiss/click on notifications

Assessor Effort: Explicit Judgments

Page 42: Interleaving - SIGIR 2016 presentation

• Solution: Pay attention with probability r• Good prediction accuracy with limited user

interactions.

Assessor Effort: Explicit Judgments

No extra workStill reasonable

Page 43: Interleaving - SIGIR 2016 presentation

Assessor Effort: Combining Both

• Randomly discarding system output + limited user interactions• Accuracy and verbosity tradeoff curves

Page 44: Interleaving - SIGIR 2016 presentation

Summary• A novel interleaved evaluation

methodology• A temporal interleaving strategy• A heuristic credit assignment method • A user interaction model with explicit

judgments• A simulation-based meta-evaluation• Analysis on assessor effort• Output length• Explicit judgments

Page 45: Interleaving - SIGIR 2016 presentation

Participant Systems

Baseline System

YoGosling

Twitter Streaming API

TREC RTS Server

Mobile App Assessors

Interleaved

Evaluation

Human-in-the-loop assessment

Page 46: Interleaving - SIGIR 2016 presentation

Questions?