Upload
xin-qian
View
101
Download
4
Embed Size (px)
Citation preview
Interleaved Evaluation for Retrospective Summarization
and Prospective Notification on Document Streams
Xin Qian, Jimmy Lin, Adam Roegiest
David R. Cheriton School of Computer Science
University of Waterloo
Monday, July 18, 2016
Volunteering Rio 2016 Summer Olympics
Source: http://www.phdcomics.com/comics/archive_print.php?comicid=1414
Retrospective summarization Tweet Timeline Generation, TREC
2014
Now Future
Prospective notificationPush Notification Scenario, TREC 2015
More than Microblog…• RSS feed, social media posts, medical record
updates
• Controlled experiments in A/B test • Random noise between buckets• Low question bandwidth
Evaluation Methodology: A/B test
Evaluation Methodology: A/B test vs. interleaving
• Within-subject design in interleaving• Better sensitivity • Prompt experiment progress
Interleaved Evaluation Methodology• How exactly do we interleave the output of
two systems into one single output?• Balanced interleaving [1]• Team-Draft interleaving [2] • Probabilistic interleaving [3]
• How do we assign credit to each system in response to user interactions with the interleaved results?• Aggregate over user clicks [1, 2, 3]• More sophisticated click aggregation strategies
[4, 5, 6]Source: [1] T. Joachims. Optimizing search engines using clickthrough data. KDD, 2002. [2] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reect retrieval quality? CIKM, 2008. [3] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. CIKM, 2011. [4] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. TOIS, 2012. [5] F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. SIGIR 2010. [6] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. SIGIR 2010.
Temporality
Source: https://unsplash.com/photos/ft0-Xu4nTvA
Existing Approaches
Temporality
Balanced interleavingTeam-Draft interleavingProbabilistic interleaving
Not satisfied
Verbosity
Source: https://unsplash.com/photos/QlnUpMED6Qs
Existing Approaches
Temporality Verbosity
Balanced interleavingTeam-Draft interleavingProbabilistic interleaving
Not satisfied Could be satisfied but largely biased
Source: https://unsplash.com/photos/jqlXLIj3aS8
Redundancy
Existing Approaches
Temporality Verbosity Redundancy
Balanced interleavingTeam-Draft interleavingProbabilistic interleaving
Not satisfied Could be satisfied but largely biased
Existing Approaches
Temporality Verbosity Redundancy
Balanced interleavingTeam-Draft interleavingProbabilistic interleaving
?
Not satisfied Could be satisfied but largely biased Satisfied
Temporal Interleaving
• Union and sort by document timestamp T• Prospective notifications, T = tweet push
time• E.g. Neu-IR 2016
System A results
System B results
742982648566718202
742853770786425799
743582679184237590
743682473083990521
743682506294099968
743682673084772352
743682473083990521
De-duplication
System A results
System B results
742982648566718202
742853770786425799
743582679184237590
743682473083990521
743682506294099968
743682673084772352
743682473083990521
De-duplication matters!
User Interaction - Explicit judgments
Not relevant
Relevant
Relevant but redundant
System A credit
System B credit
System A credit
System B credit
System A credit
System B credit
System A credit
System B credit
+1
System A credit
System B credit
+1
System A credit
System B credit
+1
“Masking Effect”
System A credit
System B credit
+1
Solution: discount factor = the fraction of relevant or redundant tweets above that come from the other system
“Masking Effect”
System A credit
System B credit
+1
+1
System A credit
System B credit
+1
Alternative: asking user the source of the redundancy
“Masking Effect”
System A credit
System B credit
+1
+1
System A credit
System B credit
+1
+1
System A credit
System B credit
+1
+1
System A credit
System B credit
+1
+1
+1
System A credit
System B credit
+1
+1
+1
System A credit
System B credit
+1
+1
+1
+0.66
Simulation-based Meta-evaluation: Dataset
• Grounded in the Microblog track at TREC• TREC 2015 real-time filtering task, push
notification• 14 groups for 37 runs• 1 semantic cluster annotation
• Normalized cumulative gain (nCG): • is the maximum possible gain
6:28pm At least seven killed in shooting at Sikh temple in Wisconsin
6:10pm >= 7 killed in shooting @ Sikh temple in Wisconsin
6:10pm 4 were shot inside the Sikh Temple of Wisconsin and 3 outside, including a gunman killed by police
Four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police around 6:10pm
6:28pm seven Wisconsin shooting at Sikh temple
6:10pm four were shot inside the Sikh Temple and 3 outside, including a gunman killed by police6:10pm four
were shot inside the Sikh Temple and 3 outside, including a gunman killed by police
• Apply the temporal interleaving strategy• Assume a user with the semantic cluster
annotation• Difference of Assigned credit• Difference of nCG
Simulation-based Meta-evaluation: Experiment Setup
Correct
Correct Incorrect
Incorrect
Agreement =
Conclusions•
• Effectiveness: 92%-93% agreement
• Highest effectiveness: all pairs, binary relevance on “simple task” *• Effect of using different assessors *• Cluster importance and cluster weight *• Effect of “quiet days” and treatment * • Recall-oriented credit assignment * • Precision/recall tradeoff *
Agreement =
Interleaved Evaluations (nearly) doubles the work!
Criticism
Assessor Effort: Output Length
• Solution: Flip a biased coin and keep with probability pNo extra work
Still reasonable
Source: [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user interaction models for predicting web search result preferences. SIGIR, 2006. [2] O. Chapelle and Y. Zhang. A Dynamic Bayesian Network click model for web search ranking. WWW, 2009. [3] L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user behavior in WWW search. SIGIR, 2004. [4] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS, 25(2):1{27, 2007. [5] D. Kelly. Understanding implicit feedback and document preference: A naturalistic user study. SIGIR Forum, 38(1):77{77, 2004. [6] D. Kelly and J. Teevan. Implicit feedback for inferring user preference: A bibliography. SIGIR Forum, 37(2):18{28, 2003.
• Implicit judgments• Web search: click models[1, 2], eye-tracking
studies[3, 4], etc [5, 6]. • multi-media elements, different types of clicks…...
Assessor Effort: Explicit Judgments
• Explicit judgments: dismiss/click on notifications
Assessor Effort: Explicit Judgments
• Solution: Pay attention with probability r• Good prediction accuracy with limited user
interactions.
Assessor Effort: Explicit Judgments
No extra workStill reasonable
Assessor Effort: Combining Both
• Randomly discarding system output + limited user interactions• Accuracy and verbosity tradeoff curves
Summary• A novel interleaved evaluation
methodology• A temporal interleaving strategy• A heuristic credit assignment method • A user interaction model with explicit
judgments• A simulation-based meta-evaluation• Analysis on assessor effort• Output length• Explicit judgments
Participant Systems
Baseline System
YoGosling
Twitter Streaming API
TREC RTS Server
Mobile App Assessors
Interleaved
Evaluation
Human-in-the-loop assessment
Questions?