37
Topic Set Size Design with the Evaluation Measures for Short Text Conversation Tetsuya Sakai (Waseda University) Lifeng Shang, Zhengdong Lu, Hang Li (Huawei Noah’s Ark Lab) [email protected] December 4, 2015@AIRS 2015, Brisbane

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Embed Size (px)

Citation preview

Page 1: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Tetsuya Sakai (Waseda University)Lifeng Shang, Zhengdong Lu, Hang Li (Huawei Noah’s Ark Lab)

[email protected]

December 4, 2015@AIRS 2015, Brisbane

Page 2: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 3: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

What is STC? (1)

A user’s post on twitter or weibo

Note: the real NTCIR-12 STC task deals with Chinese (weibo) and Japanese (twitter) only.

Page 4: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

What is STC? (2)

Possible responses (comments)

Page 5: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

NTCIR-12 STC task definition

Given a new post, can the system return a “good” response by retrieving a comment to an old post from a repository?

old post old comment

old post old comment

old post old comment

old post old comment

old post old comment

new post

new post

new post

old comment

old comment

old comment

new post

new post For each new post, retrieve and rank old comments!

Graded label (L0-L2) for each comment

Repository Training data Test data

Page 6: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

NTCIR-12 STC labels: “good” comments?

Comment labellers are instructed to imagine that they are the authors of the new post. Criteria:

(a) Coherent: logically connected with the new post

(b) Topically relevant: the topic matches with that of the new post

(c) Context-independent: good or not does not depend on situations

(d) Non-repetitive: does not just repeat what the new post says

If either (a) or (b) is untrue, label = L0.

Else if either (c) or (d) is untrue, label = L1.

Otherwise (a)-(d) are all true, so label = L2.

Note: this study used “old” labels that are based on an early version of the above criteria.

Page 7: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 8: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Evaluation measures for STC

• Navigational search measures

(basically one good comment is enough)

- nG@1 (normalised gain at 1)

- nERR@10 (normalised expected reciprocal rank at 10)

- P+new post old comment

old comment

old comment

old comment

1

2

3

4

input rankedoutput

L1-relevant

L2-relevant

Page 9: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 points

1 points

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

3 points

1 point

Nonrelevantk

:

nG@1=1/3

nG@1 = 0 or 1/3 or 1

Gain Gain

Page 10: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

nERR (1) [Chapelle11]

nERR normalises ERR to ensure the [0,1] range

Stopping probability over ranks

User does not like items at ranks 1,...,(r-1).Finally likes the item at rank r

Stopping probability at r

Utility for users who stop at r

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]

Page 11: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

nERR (2) [Chapelle11]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

All users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

For users that stop at rank 1: 1/4 * 1/1 = 0.2500

For users that stop at rank 3: (3/4 * 3/4) * 1/3 = 0.1875

ERR = 0.2500 + 0.1875 = 0.4375

Page 12: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

nERR (3) [Chapelle11]

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list All users

3/4 of users

3/4 of users

1/4 of users

1/4 of users

1/4 of users

1/4 of users

3/4 of users

3/4 of users

3/4 * 1/1 = 0.7500

1/4 * 3/4 * 1/2 = 0.09375

1/4 * 1/4 *1/4 * 1/3 = 0.005208

1/4 * 1/4 * 3/4 * 1/4 * 1/4 = 0.00293

ERR* = 0.7500 + 0.09375 + 0.005208 + 0.00293= 0.8519 (< 1)

Page 13: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

nERR (4) [Chapelle11]

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

All users All users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

1/4 of users

1/4 of users

1/4 of users

3/4 of users

3/4 of users

ERR = 0.4375

ERR* = 0.8519

nERR = ERR/ERR* = 0.5136

Page 14: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P+ (1) [Sakai06AIRS,Sakai15book]

Blended ratio(cf. Q-measure)

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]

Stopping probability at r

Utility for users who stop at r

Precision

Normalisedcumulative gain(nCG)

Uniform user distribution over relevant docs at or above rp

Rank of the most relevant in list,

nearest to the top

Page 15: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P+ (2) [Sakai06AIRS,Sakai15book]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

Page 16: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P+ (3) [Sakai06AIRS,Sakai15book]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(1) = (1 + 1)/(1 + 3) = 0.5

#relevant found so far

Page 17: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P+ (4) [Sakai06AIRS,Sakai15book]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

#relevant found so far

Page 18: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P+ (5) [Sakai06AIRS,Sakai15book]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

BR(1) = (1 + 1)/(1 + 3) = 0.5

P+ = (BR(1) + BR(3))/ 2 = 0.5500

Page 19: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 20: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Topic set size design [Sakai15IRJ]http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

• Uses sample size design techniques [Nagata03] to determine the number of topics for a new test collection

• The present study uses the one-way ANOVA-based method for comparing m systems.

• When m=2, the result is equivalent to that of the t-test-based method.

• When m=10, the result is equivalent to requiring a confidence interval (CI) upperbound (δ) for comparing two systems.

open access!

Page 21: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Input to topic set size design

• α: Probability of Type I error (detecting a nonexistent difference)

• β: Probability of Type II error (missing a real difference)

• m: number of systems to be compared with one-way ANOVA

• minD: minimum detectable range – we want to ensure (1-β)% power whenever the true diff between the best and the worst system (D) is minD or larger.

• : variance common across all systems

mean of best system

mean of worst system

means of other

systemsD

Score for System i, Topic j

Page 22: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

How to estimate the common variance for each evaluation measure

Given an nC×mC topic-by-run score matrix for M, the within-system variance can be estimated as:

Given multiple matrices, a more reliable estimate can be obtained:

From one-way ANOVA

Pooled variance:not applied in this study

Not available in this study

Page 23: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

EnterminD, m and

Required topic set size n=98

Excel sheet for (α, β) = (0.05, 0.20)

Page 24: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Three topic set size design methods [Sakai15IRJ]ANOVA with m=2 equivalent to

t-test

ANOVA with m=10, minD=c equivalent to CI with δ=c

Page 25: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 26: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Creating the topic-by-run matrices for obtaining ’s• nC = 225 training topics (posts) from the STC data

• mC = 6 pilot runs (tuned with training data)

new post

new post

new post

old comment

old comment

old comment

Training data

Graded label (L0-L2) for each comment

Mean over 225 topics

Page 27: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Features used in the pilot runshttp://arxiv.org/pdf/1408.6988.pdf

• Q2P: similarity between new post (query) and old post. Vector space model.

• Q2C: similarity between new post (query) and old comment. Vector space model.

• TransLM: translation-based language model for bridging the gap between new post and old post-comment.

• TopicWord: topic word model for estimating the probability that each word in an old post/comment is related to the main topic. Logistic regression.

old post old comment

old post old comment

old post old comment

old post old comment

old post old comment

Repository

new post old comment

Training data

Page 28: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

P-values and effect sizes (ESHSD) [Sakai15forum]

P-values based onrandomisedTukey HSD

ESHSD =

Residual variance from two-way ANOVA

without replication

Page 29: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Main result (α=0.05, β=0.20)#systems compared

diff between best and worst

systems

If we create n=100topics, these requirementswill be satisfied.

Page 30: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Topic set size decision: n=100 for test topics! (1)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.10, 2).

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.15, 10). The CI of the difference between any system pair is expected to be δ=0.15 or smaller.

If the true difference between the two systems is 0.10 or larger, 80% power is guaranteed

If the true difference between the best and the worst among m=10 systems is 0.15 or larger, 80% power is guaranteed

Page 31: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Topic set size decision: n=100 for test topics! (2)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 50).

• Using nG@1, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 5).

If the true difference between the best and the worst among m=50 systems is 0.20 or larger, 80% power is guaranteed

If the true difference between the best and the worst among m=5 systems is 0.20 or larger, 80% power is guaranteed

Page 32: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 33: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Summary

• STC is the first evaluation task to use topic set size design!

• Based on the results, we decided to have n=100 topics. The important point is that we KNOW what this means from a statistical viewpoint:

(α, β, minD, m) or (α, β, δ)

• We can obtain more reliable topic-by-run matrices from the official STC-1 results. Better ’s can be obtained for designing the STC-2 collection.

difference in population means of best and worst CI width for difference between two systems

Page 34: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Test collections should keep improving! (1)

Trainingtopics

Pilot runs

Matrices in this study

Matrices from official STC-1

STC-1test topics

STC-1 runs

Estimate within-systemvariances anddeterminetopic set sizes

n topics

What we did in this study

Page 35: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Test collections should keep improving! (2)

Trainingtopics

Pilot runs

Matrices in this study

Matrices from official STC-1

STC-1test topics

STC-1 runs

Matrices from official STC-2

STC-2test topics

STC-2 runs

Estimate within-systemvariances andadjusttopic set sizes

Pooling variances for higher accuracy

n’ topics

Page 36: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

Page 37: Topic Set Size Design with the Evaluation Measures for Short Text Conversation

You can still join theJapanese subtask!

Chinese

Japanese