Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Tetsuya Sakai (Waseda University)Lifeng Shang, Zhengdong Lu, Hang Li (Huawei Noah’s Ark Lab)

[email protected]

December 4, 2015@AIRS 2015, Brisbane

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

What is STC? (1)

A user’s post on twitter or weibo

Note: the real NTCIR-12 STC task deals with Chinese (weibo) and Japanese (twitter) only.

What is STC? (2)

Possible responses (comments)

NTCIR-12 STC task definition

Given a new post, can the system return a “good” response by retrieving a comment to an old post from a repository?

old post old comment





new post

new post

new post

old comment

old comment

old comment

new post

new post For each new post, retrieve and rank old comments!

Graded label (L0-L2) for each comment

Repository Training data Test data

NTCIR-12 STC labels: “good” comments?

Comment labellers are instructed to imagine that they are the authors of the new post. Criteria:

(a) Coherent: logically connected with the new post

(b) Topically relevant: the topic matches with that of the new post

(c) Context-independent: good or not does not depend on situations

(d) Non-repetitive: does not just repeat what the new post says

If either (a) or (b) is untrue, label = L0.

Else if either (c) or (d) is untrue, label = L1.

Otherwise (a)-(d) are all true, so label = L2.

Note: this study used “old” labels that are based on an early version of the above criteria.

TALK OUTLINE


Evaluation measures for STC

• Navigational search measures

(basically one good comment is enough)

- nG@1 (normalised gain at 1)

- nERR@10 (normalised expected reciprocal rank at 10)

- P+new post old comment

old comment

old comment

old comment

1

2

3

4

input rankedoutput

L1-relevant

L2-relevant

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 points

1 points

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

3 points

1 point

Nonrelevantk

:

nG@1=1/3

nG@1 = 0 or 1/3 or 1

Gain Gain

nERR (1) [Chapelle11]

nERR normalises ERR to ensure the [0,1] range

Stopping probability over ranks

User does not like items at ranks 1,...,(r-1).Finally likes the item at rank r

Stopping probability at r

Utility for users who stop at r

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]


L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

All users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

For users that stop at rank 1: 1/4 * 1/1 = 0.2500

For users that stop at rank 3: (3/4 * 3/4) * 1/3 = 0.1875

ERR = 0.2500 + 0.1875 = 0.4375


L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list All users

3/4 of users

3/4 of users

1/4 of users

1/4 of users

1/4 of users

1/4 of users

3/4 of users

3/4 of users

3/4 * 1/1 = 0.7500

1/4 * 3/4 * 1/2 = 0.09375

1/4 * 1/4 *1/4 * 1/3 = 0.005208

1/4 * 1/4 * 3/4 * 1/4 * 1/4 = 0.00293

ERR* = 0.7500 + 0.09375 + 0.005208 + 0.00293= 0.8519 (< 1)


L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

All users All users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

3/4 of users

3/4 of users

1/4 of users

1/4 of users

1/4 of users

1/4 of users

3/4 of users

3/4 of users

ERR = 0.4375

ERR* = 0.8519

nERR = ERR/ERR* = 0.5136

P+ (1) [Sakai06AIRS,Sakai15book]

Blended ratio(cf. Q-measure)

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]

Stopping probability at r

Utility for users who stop at r

Precision

Normalisedcumulative gain(nCG)

Uniform user distribution over relevant docs at or above rp

Rank of the most relevant in list,

nearest to the top


L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain


L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant


No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(1) = (1 + 1)/(1 + 3) = 0.5

#relevant found so far


L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant


No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

#relevant found so far


L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

1

2

3

4

System output

Nonrelevantk

:

rp : most relevant


No user will

go beyond rp

50% of users

50% of users

1 point

3 points

L2-relevant

L2-relevant

L1-relevant

L1-relevant

1

2

3

4

ideal ranked list

3 points

3 points

1 point

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

BR(1) = (1 + 1)/(1 + 3) = 0.5

P+ = (BR(1) + BR(3))/ 2 = 0.5500

TALK OUTLINE


Topic set size design [Sakai15IRJ]http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

• Uses sample size design techniques [Nagata03] to determine the number of topics for a new test collection

• The present study uses the one-way ANOVA-based method for comparing m systems.

• When m=2, the result is equivalent to that of the t-test-based method.

• When m=10, the result is equivalent to requiring a confidence interval (CI) upperbound (δ) for comparing two systems.

open access!

Input to topic set size design

• α: Probability of Type I error (detecting a nonexistent difference)

• β: Probability of Type II error (missing a real difference)

• m: number of systems to be compared with one-way ANOVA

• minD: minimum detectable range – we want to ensure (1-β)% power whenever the true diff between the best and the worst system (D) is minD or larger.

• : variance common across all systems

mean of best system

mean of worst system

means of other

systemsD

Score for System i, Topic j

How to estimate the common variance for each evaluation measure

Given an nC×mC topic-by-run score matrix for M, the within-system variance can be estimated as:

Given multiple matrices, a more reliable estimate can be obtained:

From one-way ANOVA

Pooled variance:not applied in this study

Not available in this study

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

EnterminD, m and

Required topic set size n=98

Excel sheet for (α, β) = (0.05, 0.20)

Three topic set size design methods [Sakai15IRJ]ANOVA with m=2 equivalent to

t-test

ANOVA with m=10, minD=c equivalent to CI with δ=c

TALK OUTLINE


Creating the topic-by-run matrices for obtaining ’s• nC = 225 training topics (posts) from the STC data

• mC = 6 pilot runs (tuned with training data)

new post

new post

new post

old comment

old comment

old comment

Training data

Graded label (L0-L2) for each comment

Mean over 225 topics

Features used in the pilot runshttp://arxiv.org/pdf/1408.6988.pdf

• Q2P: similarity between new post (query) and old post. Vector space model.

• Q2C: similarity between new post (query) and old comment. Vector space model.

• TransLM: translation-based language model for bridging the gap between new post and old post-comment.

• TopicWord: topic word model for estimating the probability that each word in an old post/comment is related to the main topic. Logistic regression.






Repository

new post old comment

Training data

P-values and effect sizes (ESHSD) [Sakai15forum]

P-values based onrandomisedTukey HSD

ESHSD =

Residual variance from two-way ANOVA

without replication

Main result (α=0.05, β=0.20)#systems compared

diff between best and worst

systems

If we create n=100topics, these requirementswill be satisfied.

Topic set size decision: n=100 for test topics! (1)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.10, 2).

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.15, 10). The CI of the difference between any system pair is expected to be δ=0.15 or smaller.

If the true difference between the two systems is 0.10 or larger, 80% power is guaranteed

If the true difference between the best and the worst among m=10 systems is 0.15 or larger, 80% power is guaranteed

Topic set size decision: n=100 for test topics! (2)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 50).

• Using nG@1, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 5).



TALK OUTLINE


Summary

• STC is the first evaluation task to use topic set size design!

• Based on the results, we decided to have n=100 topics. The important point is that we KNOW what this means from a statistical viewpoint:

(α, β, minD, m) or (α, β, δ)

• We can obtain more reliable topic-by-run matrices from the official STC-1 results. Better ’s can be obtained for designing the STC-2 collection.

difference in population means of best and worst CI width for difference between two systems

Test collections should keep improving! (1)

Trainingtopics

Pilot runs

Matrices in this study

Matrices from official STC-1

STC-1test topics

STC-1 runs

Estimate within-systemvariances anddeterminetopic set sizes

n topics

What we did in this study

Test collections should keep improving! (2)

Trainingtopics

Pilot runs

Matrices in this study


STC-1test topics

STC-1 runs


STC-2test topics

STC-2 runs

Estimate within-systemvariances andadjusttopic set sizes

Pooling variances for higher accuracy

n’ topics

TALK OUTLINE


You can still join theJapanese subtask!

Chinese

Japanese

Technology

Topic Set Size Design with the Evaluation Measures for Short Text Conversation