Upload
tetsuya-sakai
View
370
Download
1
Embed Size (px)
Citation preview
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai (Waseda University)Lifeng Shang, Zhengdong Lu, Hang Li (Huawei Noah’s Ark Lab)
December 4, 2015@AIRS 2015, Brisbane
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
What is STC? (1)
A user’s post on twitter or weibo
Note: the real NTCIR-12 STC task deals with Chinese (weibo) and Japanese (twitter) only.
What is STC? (2)
Possible responses (comments)
NTCIR-12 STC task definition
Given a new post, can the system return a “good” response by retrieving a comment to an old post from a repository?
old post old comment
old post old comment
old post old comment
old post old comment
old post old comment
new post
new post
new post
old comment
old comment
old comment
new post
new post For each new post, retrieve and rank old comments!
Graded label (L0-L2) for each comment
Repository Training data Test data
NTCIR-12 STC labels: “good” comments?
Comment labellers are instructed to imagine that they are the authors of the new post. Criteria:
(a) Coherent: logically connected with the new post
(b) Topically relevant: the topic matches with that of the new post
(c) Context-independent: good or not does not depend on situations
(d) Non-repetitive: does not just repeat what the new post says
If either (a) or (b) is untrue, label = L0.
Else if either (c) or (d) is untrue, label = L1.
Otherwise (a)-(d) are all true, so label = L2.
Note: this study used “old” labels that are based on an early version of the above criteria.
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
Evaluation measures for STC
• Navigational search measures
(basically one good comment is enough)
- nG@1 (normalised gain at 1)
- nERR@10 (normalised expected reciprocal rank at 10)
- P+new post old comment
old comment
old comment
old comment
1
2
3
4
input rankedoutput
L1-relevant
L2-relevant
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
3 points
3 points
1 points
1 points
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
3 points
1 point
Nonrelevantk
:
nG@1=1/3
nG@1 = 0 or 1/3 or 1
Gain Gain
nERR (1) [Chapelle11]
nERR normalises ERR to ensure the [0,1] range
Stopping probability over ranks
User does not like items at ranks 1,...,(r-1).Finally likes the item at rank r
Stopping probability at r
Utility for users who stop at r
A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]
nERR (2) [Chapelle11]
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
All users
1/4 of users
3/4 of users
3/4 of users
1/4 of users
For users that stop at rank 1: 1/4 * 1/1 = 0.2500
For users that stop at rank 3: (3/4 * 3/4) * 1/3 = 0.1875
ERR = 0.2500 + 0.1875 = 0.4375
nERR (3) [Chapelle11]
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list All users
3/4 of users
3/4 of users
1/4 of users
1/4 of users
1/4 of users
1/4 of users
3/4 of users
3/4 of users
3/4 * 1/1 = 0.7500
1/4 * 3/4 * 1/2 = 0.09375
1/4 * 1/4 *1/4 * 1/3 = 0.005208
1/4 * 1/4 * 3/4 * 1/4 * 1/4 = 0.00293
ERR* = 0.7500 + 0.09375 + 0.005208 + 0.00293= 0.8519 (< 1)
nERR (4) [Chapelle11]
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
All users All users
1/4 of users
3/4 of users
3/4 of users
1/4 of users
3/4 of users
3/4 of users
1/4 of users
1/4 of users
1/4 of users
1/4 of users
3/4 of users
3/4 of users
ERR = 0.4375
ERR* = 0.8519
nERR = ERR/ERR* = 0.5136
P+ (1) [Sakai06AIRS,Sakai15book]
Blended ratio(cf. Q-measure)
A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]
Stopping probability at r
Utility for users who stop at r
Precision
Normalisedcumulative gain(nCG)
Uniform user distribution over relevant docs at or above rp
Rank of the most relevant in list,
nearest to the top
P+ (2) [Sakai06AIRS,Sakai15book]
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
rp : most relevant
in list, nearest to the top
No user will
go beyond rp
50% of users
50% of users
1 point
3 points
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
3 points
3 points
1 point
1 point
Gain Gain
P+ (3) [Sakai06AIRS,Sakai15book]
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
rp : most relevant
in list, nearest to the top
No user will
go beyond rp
50% of users
50% of users
1 point
3 points
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
3 points
3 points
1 point
1 point
Gain Gain
BR(1) = (1 + 1)/(1 + 3) = 0.5
#relevant found so far
P+ (4) [Sakai06AIRS,Sakai15book]
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
rp : most relevant
in list, nearest to the top
No user will
go beyond rp
50% of users
50% of users
1 point
3 points
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
3 points
3 points
1 point
1 point
Gain Gain
BR(3) = (2 + 4)/(3 + 7) = 0.6
#relevant found so far
P+ (5) [Sakai06AIRS,Sakai15book]
L1-relevant
Nonrelevant
L2-relevant
Nonrelevant
1
2
3
4
System output
Nonrelevantk
:
rp : most relevant
in list, nearest to the top
No user will
go beyond rp
50% of users
50% of users
1 point
3 points
L2-relevant
L2-relevant
L1-relevant
L1-relevant
1
2
3
4
ideal ranked list
3 points
3 points
1 point
1 point
Gain Gain
BR(3) = (2 + 4)/(3 + 7) = 0.6
BR(1) = (1 + 1)/(1 + 3) = 0.5
P+ = (BR(1) + BR(3))/ 2 = 0.5500
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
Topic set size design [Sakai15IRJ]http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
• Uses sample size design techniques [Nagata03] to determine the number of topics for a new test collection
• The present study uses the one-way ANOVA-based method for comparing m systems.
• When m=2, the result is equivalent to that of the t-test-based method.
• When m=10, the result is equivalent to requiring a confidence interval (CI) upperbound (δ) for comparing two systems.
open access!
Input to topic set size design
• α: Probability of Type I error (detecting a nonexistent difference)
• β: Probability of Type II error (missing a real difference)
• m: number of systems to be compared with one-way ANOVA
• minD: minimum detectable range – we want to ensure (1-β)% power whenever the true diff between the best and the worst system (D) is minD or larger.
• : variance common across all systems
mean of best system
mean of worst system
means of other
systemsD
Score for System i, Topic j
How to estimate the common variance for each evaluation measure
Given an nC×mC topic-by-run score matrix for M, the within-system variance can be estimated as:
Given multiple matrices, a more reliable estimate can be obtained:
From one-way ANOVA
Pooled variance:not applied in this study
Not available in this study
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
EnterminD, m and
Required topic set size n=98
Excel sheet for (α, β) = (0.05, 0.20)
Three topic set size design methods [Sakai15IRJ]ANOVA with m=2 equivalent to
t-test
ANOVA with m=10, minD=c equivalent to CI with δ=c
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
Creating the topic-by-run matrices for obtaining ’s• nC = 225 training topics (posts) from the STC data
• mC = 6 pilot runs (tuned with training data)
new post
new post
new post
old comment
old comment
old comment
Training data
Graded label (L0-L2) for each comment
Mean over 225 topics
Features used in the pilot runshttp://arxiv.org/pdf/1408.6988.pdf
• Q2P: similarity between new post (query) and old post. Vector space model.
• Q2C: similarity between new post (query) and old comment. Vector space model.
• TransLM: translation-based language model for bridging the gap between new post and old post-comment.
• TopicWord: topic word model for estimating the probability that each word in an old post/comment is related to the main topic. Logistic regression.
old post old comment
old post old comment
old post old comment
old post old comment
old post old comment
Repository
new post old comment
Training data
P-values and effect sizes (ESHSD) [Sakai15forum]
P-values based onrandomisedTukey HSD
ESHSD =
Residual variance from two-way ANOVA
without replication
Main result (α=0.05, β=0.20)#systems compared
diff between best and worst
systems
If we create n=100topics, these requirementswill be satisfied.
Topic set size decision: n=100 for test topics! (1)
• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.10, 2).
• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.15, 10). The CI of the difference between any system pair is expected to be δ=0.15 or smaller.
If the true difference between the two systems is 0.10 or larger, 80% power is guaranteed
If the true difference between the best and the worst among m=10 systems is 0.15 or larger, 80% power is guaranteed
Topic set size decision: n=100 for test topics! (2)
• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 50).
• Using nG@1, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 5).
If the true difference between the best and the worst among m=50 systems is 0.20 or larger, 80% power is guaranteed
If the true difference between the best and the worst among m=5 systems is 0.20 or larger, 80% power is guaranteed
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
Summary
• STC is the first evaluation task to use topic set size design!
• Based on the results, we decided to have n=100 topics. The important point is that we KNOW what this means from a statistical viewpoint:
(α, β, minD, m) or (α, β, δ)
• We can obtain more reliable topic-by-run matrices from the official STC-1 results. Better ’s can be obtained for designing the STC-2 collection.
difference in population means of best and worst CI width for difference between two systems
Test collections should keep improving! (1)
Trainingtopics
Pilot runs
Matrices in this study
Matrices from official STC-1
STC-1test topics
STC-1 runs
Estimate within-systemvariances anddeterminetopic set sizes
n topics
What we did in this study
Test collections should keep improving! (2)
Trainingtopics
Pilot runs
Matrices in this study
Matrices from official STC-1
STC-1test topics
STC-1 runs
Matrices from official STC-2
STC-2test topics
STC-2 runs
Estimate within-systemvariances andadjusttopic set sizes
Pooling variances for higher accuracy
n’ topics
TALK OUTLINE
1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?
You can still join theJapanese subtask!
Chinese
Japanese