Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Tetsuya Sakai (Waseda University)Lifeng Shang, Zhengdong Lu, Hang Li (Huawei Noah’s Ark Lab)

tetsuyasakai@acm.org

December 4, 2015@AIRS 2015, Brisbane

TALK OUTLINE

1.Short Text Conversation2.Evaluation measures for STC3.Topic Set Size Design4.Experiments for STC5.Summary6.Interested in STC?

What is STC? (1)

A user’s post on twitter or weibo

Note: the real NTCIR-12 STC task deals with Chinese (weibo) and Japanese (twitter) only.

What is STC? (2)

Possible responses (comments)

NTCIR-12 STC task definition

Given a new post, can the system return a “good” response by retrieving a comment to an old post from a repository?

old post old comment

new post

old comment

new post

new post For each new post, retrieve and rank old comments!

Graded label (L0-L2) for each comment

Repository Training data Test data

NTCIR-12 STC labels: “good” comments?

Comment labellers are instructed to imagine that they are the authors of the new post. Criteria:

(a) Coherent: logically connected with the new post

(b) Topically relevant: the topic matches with that of the new post

(c) Context-independent: good or not does not depend on situations

(d) Non-repetitive: does not just repeat what the new post says

If either (a) or (b) is untrue, label = L0.

Else if either (c) or (d) is untrue, label = L1.

Otherwise (a)-(d) are all true, so label = L2.

Note: this study used “old” labels that are based on an early version of the above criteria.

TALK OUTLINE

Evaluation measures for STC

• Navigational search measures

(basically one good comment is enough)

- nG@1 (normalised gain at 1)

- nERR@10 (normalised expected reciprocal rank at 10)

- P+new post old comment

old comment

input rankedoutput

L1-relevant

L2-relevant

L1-relevant

ideal ranked list

3 points

1 points

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

3 points

1 point

Nonrelevantk

nG@1=1/3

nG@1 = 0 or 1/3 or 1

Gain Gain

nERR (1) [Chapelle11]

nERR normalises ERR to ensure the [0,1] range

Stopping probability over ranks

User does not like items at ranks 1,...,(r-1).Finally likes the item at rank r

Stopping probability at r

Utility for users who stop at r

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

All users

1/4 of users

3/4 of users

1/4 of users

For users that stop at rank 1: 1/4 * 1/1 = 0.2500

For users that stop at rank 3: (3/4 * 3/4) * 1/3 = 0.1875

ERR = 0.2500 + 0.1875 = 0.4375

L2-relevant

L1-relevant

ideal ranked list All users

3/4 of users

1/4 of users

3/4 of users

3/4 * 1/1 = 0.7500

1/4 * 3/4 * 1/2 = 0.09375

1/4 * 1/4 *1/4 * 1/3 = 0.005208

1/4 * 1/4 * 3/4 * 1/4 * 1/4 = 0.00293

ERR* = 0.7500 + 0.09375 + 0.005208 + 0.00293= 0.8519 (< 1)

L2-relevant

L1-relevant

ideal ranked list

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

All users All users

1/4 of users

3/4 of users

1/4 of users

3/4 of users

1/4 of users

3/4 of users

ERR = 0.4375

ERR* = 0.8519

nERR = ERR/ERR* = 0.5136

P+ (1) [Sakai06AIRS,Sakai15book]

Blended ratio(cf. Q-measure)

A Normalised Cumulative Utilitymeasure [Sakai/Robertson08EVIA]

Stopping probability at r

Utility for users who stop at r

Precision

Normalisedcumulative gain(nCG)

Uniform user distribution over relevant docs at or above rp

Rank of the most relevant in list,

nearest to the top

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

rp : most relevant

in list, nearest to the top

No user will

go beyond rp

50% of users

1 point

3 points

L2-relevant

L1-relevant

ideal ranked list

3 points

1 point

Gain Gain

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

rp : most relevant

No user will

go beyond rp

50% of users

1 point

3 points

L2-relevant

L1-relevant

ideal ranked list

3 points

1 point

Gain Gain

BR(1) = (1 + 1)/(1 + 3) = 0.5

#relevant found so far

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

rp : most relevant

No user will

go beyond rp

50% of users

1 point

3 points

L2-relevant

L1-relevant

ideal ranked list

3 points

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

#relevant found so far

L1-relevant

Nonrelevant

L2-relevant

Nonrelevant

System output

Nonrelevantk

rp : most relevant

No user will

go beyond rp

50% of users

1 point

3 points

L2-relevant

L1-relevant

ideal ranked list

3 points

1 point

Gain Gain

BR(3) = (2 + 4)/(3 + 7) = 0.6

BR(1) = (1 + 1)/(1 + 3) = 0.5

P+ = (BR(1) + BR(3))/ 2 = 0.5500

TALK OUTLINE

Topic set size design [Sakai15IRJ]http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

• Uses sample size design techniques [Nagata03] to determine the number of topics for a new test collection

• The present study uses the one-way ANOVA-based method for comparing m systems.

• When m=2, the result is equivalent to that of the t-test-based method.

• When m=10, the result is equivalent to requiring a confidence interval (CI) upperbound (δ) for comparing two systems.

open access!

Input to topic set size design

• α: Probability of Type I error (detecting a nonexistent difference)

• β: Probability of Type II error (missing a real difference)

• m: number of systems to be compared with one-way ANOVA

• minD: minimum detectable range – we want to ensure (1-β)% power whenever the true diff between the best and the worst system (D) is minD or larger.

• : variance common across all systems

mean of best system

mean of worst system

means of other

systemsD

Score for System i, Topic j

How to estimate the common variance for each evaluation measure

Given an nC×mC topic-by-run score matrix for M, the within-system variance can be estimated as:

Given multiple matrices, a more reliable estimate can be obtained:

From one-way ANOVA

Pooled variance:not applied in this study

Not available in this study

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

EnterminD, m and

Required topic set size n=98

Excel sheet for (α, β) = (0.05, 0.20)

Three topic set size design methods [Sakai15IRJ]ANOVA with m=2 equivalent to

t-test

ANOVA with m=10, minD=c equivalent to CI with δ=c

TALK OUTLINE

Creating the topic-by-run matrices for obtaining ’s• nC = 225 training topics (posts) from the STC data

• mC = 6 pilot runs (tuned with training data)

new post

old comment

Training data

Graded label (L0-L2) for each comment

Mean over 225 topics

Features used in the pilot runshttp://arxiv.org/pdf/1408.6988.pdf

• Q2P: similarity between new post (query) and old post. Vector space model.

• Q2C: similarity between new post (query) and old comment. Vector space model.

• TransLM: translation-based language model for bridging the gap between new post and old post-comment.

• TopicWord: topic word model for estimating the probability that each word in an old post/comment is related to the main topic. Logistic regression.

old post old comment

Repository

new post old comment

Training data

P-values and effect sizes (ESHSD) [Sakai15forum]

P-values based onrandomisedTukey HSD

ESHSD =

Residual variance from two-way ANOVA

without replication

Main result (α=0.05, β=0.20)#systems compared

diff between best and worst

systems

If we create n=100topics, these requirementswill be satisfied.

Topic set size decision: n=100 for test topics! (1)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.10, 2).

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.15, 10). The CI of the difference between any system pair is expected to be δ=0.15 or smaller.

If the true difference between the two systems is 0.10 or larger, 80% power is guaranteed

If the true difference between the best and the worst among m=10 systems is 0.15 or larger, 80% power is guaranteed

Topic set size decision: n=100 for test topics! (2)

• Using P+ or nERR@10, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 50).

• Using nG@1, our test set will achieve (α, β, minD, m) = (0.05, 0.20, 0.20, 5).

TALK OUTLINE

Summary

• STC is the first evaluation task to use topic set size design!

• Based on the results, we decided to have n=100 topics. The important point is that we KNOW what this means from a statistical viewpoint:

(α, β, minD, m) or (α, β, δ)

• We can obtain more reliable topic-by-run matrices from the official STC-1 results. Better ’s can be obtained for designing the STC-2 collection.

difference in population means of best and worst CI width for difference between two systems

Test collections should keep improving! (1)

Trainingtopics

Pilot runs

Matrices in this study

Matrices from official STC-1

STC-1test topics

STC-1 runs

Estimate within-systemvariances anddeterminetopic set sizes

n topics

What we did in this study

Test collections should keep improving! (2)

Trainingtopics

Pilot runs

Matrices in this study

STC-1test topics

STC-1 runs

STC-2test topics

STC-2 runs

Estimate within-systemvariances andadjusttopic set sizes

Pooling variances for higher accuracy

n’ topics

TALK OUTLINE

You can still join theJapanese subtask!

Chinese

Japanese

Topic Set Size Design with the Evaluation Measures for Short Text Conversation

Technology

ESL conversation topic (Sport)

Topic Conversation Module E

Upfront Conversation Reoccurring Conversation Constant Conversation

Delight 2016 | Opening the Conversation Around a Taboo Topic: Money — Nondini Naqui

Document-based Topic Coherence Measures for News Media Text · 2018. 8. 16. · Document-based Topic Coherence Measures for News Media Text Damir Koren ci ca,, Strahil Ristova, Jan

Topic 23: Diagnostics and Remedies. Outline Diagnostics –residual checks ANOVA remedial measures

Topic Conversation Module C

Topic 4 - Measures of Spread

TOPIC 5 NONBEHAVIORAL ASSESSMENT MEASURES IMMITANCE AUDIOMETRY

SECURITY COUNCIL Topic B: Establishing Measures to …

National Conversation - the Measures that Matter

noise Hazard: Control And Hearing Conversation · DECIBEL A UNIT OF MEASUREMENT The softest sound audible to humans is zero decibels; normal conversation measures approximately 65

ECON203 Principles of Macroeconomics Week 4 Topic : Related Measures & GDP uses and limitations

Algebra 1 EOC Prep Bootcampgladesmiddle.dadeschools.net/SOAR_2020_Summer... · Topic 3: Box Plots – Part 1 Day 23 Videos: Topic 4: Box Plots – Part 2 Topic 5: Measures of Center

Conversation Skills: Keep it Going - Mr. Miller's School ... · Conversation Skills: Keep it Going! Take turns Stay on topic Use questions, feedback, and answers Use combinations-mix

Stat 217 – Day 10 Measures of spread, position (Topic 9)

Risk Management Topic Three â€“ Risk measures and economic

Topic Conversation Module B

304-2012: Topic Discovery, Tracking, and Characterization ... · Topic Discovery, Tracking, and Characterization of Social Media Conversation, continued 6 Figure 5: Evolution of Memes

Academic Conversations...An academic conversation goes beyond casual conversation. The goal is for the participants to reach a new understanding of a school topic through the use of