Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus...

Preview:

Citation preview

Results of the 2000 Topic Detection and Tracking Evaluation

in Mandarin and English

Jonathan Fiscus and George Doddington

What’s New in TDT 2000

• TDT3 corpus used in both 1999 and 2000– 120 topics used in the 2000 test: 60 1999 topics, 60 new topics

• Of 44K news stories, 24% wereat least singly judged YES or BRIEF

• 1999 and 2000 topics are very different in terms of size and cross-language makeup

– Annotation of new topics using search engine-guided annotation:“Use a search engine with on-topic story feedback and interactive searching techniques to limit the number stories read by annotators”

• Evaluation Protocol Changes– Only minor changes to the Tracking (Negative example stories)– Link Detection test set selection changed in light of last year’s experience

0100020003000400050006000700080009000

10000

TD

T3

To

pic

s

19

99

To

pic

Se

t

20

00

To

pic

Se

tCo

un

t of O

n T

op

ic S

tori

es

MandarinEnglish

Search-Guided Annotation: How will it affect scores?

• Simulate search-guided annotation using 1999 topics, 1999 annotations and 1999 Systems

Effectiveness of Seach Guided Annotation

0.001

0.01

0.1

1

10

100

1000

0.1 1 10ROF (#Read On-topic/#Read Off-Topic)

SG (FB=10) Pjudged

SG (FB=10) #Reads

Probability of a human reading a

judged story

Number of read stories

Stability in“Region ofInterest”

TDT Topic Tracking Task

training data

test data

on-topicunknownunknown

7 Participants:– Dragon, IBM, Texas A&M Univ., TNO, Univ. of Iowa, Univ. of Massachusetts, Univ. of

Maryland

System Goal:– To detect stories that discuss the target topic,

in multiple source streams.• Supervised Training

– Given Nt sample stories that discuss a given target topic

• Testing– Find all subsequent stories that discuss the target topic

Topic Tracking Results

0.1

1

Nor

mal

ized

Tra

ckin

g C

ost

English Translation Native Orthography

0.1

1

Nor

mal

ized

Tra

ckin

g C

ost English Translation, Nt=4,Nn=0

Native Orthography, Nt=4 Nn=2

Native Orthography, Nt=4, Nn=0

With Negative

Without Negative

Basic Condition: Newswire + BNews, reference story boundaries, English training: 1 On-topic

Challenge Condition: Newswire + BNews ASR, automatic story boundaries, English training: 4 On-topic, 2 Negative training

Topic Tracking Results(Expanded Basic Condition DET Curve)

Topic Tracking Results (Expanded Challenge Condition DET Curve)

With Negative

Without Negative

Effect of Automatic Story Boundaries

• Evaluation conditioned jointly by source and Language– Newswire, Broadcast News, English and Mandarin

Degradation due to story

boundaries source for ASR

Test Condition: NWT+Bnasr, 4 English Training Stories, Reference Boundaries

IBM1 UMass1

0.093 0.119

0.010

0.100

1.000

1999

Top

icT

rain

ing

2000

Top

icT

rain

ingN

orm

aliz

ed T

rack

ing

Cos

t

0.01

0.1

1

0.01 0.1 1

1999 Topic Training

20

00

To

pic

Tra

inin

g

Variability of Tracking Performance Based on Training Stories

• BBN ran their 1999 system on this year’s index files:– Same topics, but different

training stories– One caveat: these results based

on different “test epochs”, 2000 index files contain more stories

• There could be several reasons for the difference

• …needs future investigation

NIST Speech Group

TDT Link Detection Task

One Participant: University of Massachusetts

System Goal:– To detect whether a pair of stories discuss the same topic.

(Can be thought of as a “primitive operator” to build a variety of applications)

?

2000 Link Detection Results• A lot was learned last year:

– The test set must be properly sampled • “Linked” story pairs were selected by randomly sampling all

possible on-topic story pairs• “Unlinked” pairs were selected using all on-topic stories as one

of the pair, and a randomly chosen story was chosen as the second

– This year, the task was made multilingual

– More story pairs were used

0

10000

20000

30000

40000

50000

60000

70000

Overall Eng-to-Eng Eng-to-Man Man-to-Man

Num

ber

of S

tory

Pai

rs Linked Pairs

UnLinked PairsLink Detection Test Set Composition

Link Detection Results

• Required Condition– Multilingual texts– Newswire + Broadcast News ASR,– Reference story boundaries– 10 file decision deferral

0.31

0.17

0.330.41

0.10

1.00

UMass1

Nor

mal

ized

Lin

k D

etec

tion

Cos

t

OverallEng-to-EngMan-to-ManEng-to-Man

Overall

TDT Topic Detection Task

Three ParticipantsChinese Univ. of Hong Kong, Dragon, Univ. of Massachusetts

System Goal:– To detect topics in terms of the (clusters of) stories

that discuss them.

• “Unsupervised” topic training

• New topics must be detected as the incoming stories are processed.

• Input stories are then associated with one of the topics.

a topic!

• Required Condition (in yellow)

– Multilingual Topic Detection– Newswire+Broadcast News ASR– Automatic Story Boundaries

• Performance on the 1999 and 2000 topic sets are different

2000 Topic Detection Evaluation

0.1

1

CUHK1 Dragon1 UMass1

No

rma

lize

d D

ete

ctio

n C

ost

1999 Topics 2000 Topics 1999+2000 Topics

0.1

1

CUHK1 Dragon1 UMass1

Nor

mal

ized

Det

ectio

n C

ost

1999 Topics 2000 Topics 1999+2000 Topics

Using English Translations for Mandarin Using Native Orthography

Effect of Topic Size on Detection Performance

• The 1999 topics have more on-topic stories than the 2000 topics

• Distribution of scores are related to topic size– Bigger topics tend to have higher scores. – Is this a behavior induced by setting a topic size parameter in

training?

0.0001

0.001

0.01

0.1

1

1 10 100 1000

Number of On-Topic Stories

Nor

mal

ize

d D

etec

tion

Co

st 1999 Topics

2000 TopicsDragon1ResultsNWT+BNasr,

Reference Boundary, Multilingual Texts

Fractional Components of Detection Cost

• Evaluations conditioned on factors (like language) are problematic

• Instead, compute the additive contributions to detection costs for different subsets of data.

00.005

0.010.015

0.020.025

0.03

Nor

mal

ized

Det

ectio

n C

ost

English

Mandarin

0.00001

0.0001

0.001

0.01

0.1

1E-05 1E-04 0.001 0.01 0.1

Fractional Cost of English Errors

Fra

ctio

nal

Co

st o

f Man

dar

in E

rro

rs

1999 Topics2000 Topics

Dragon1ResultsNWT+BNasr,

Reference Boundary, Multilingual Texts

InterestingReversal

Effects of Automatic BoundariesOn Detection Performance

– Multilingual Topic Detection– Newswire+Broadcast News ASR– Reference Vs. Automatic Story Boundaries

0.1

1

Dragon1 Dragon2 UMass1

Norm

aliz

ed D

ete

ctio

n

Cost

Reference Boundaries

Automatic Boundaries

19%, 21% and 41%relative increase in

cost respectively

TDT Segmentation Task

Transcription:

text (words)Story:Non-story:

One Participant: MITRE(For TDT 2000, Story segmentation is an integral part of the other tasks, not just a separate evaluation task)

System Goal:– To segment the source stream into its constituent stories, for all audio sources.

Story Segmentation Results

• Required Condition: – Broadcast News ASR

MITRE Segmentation Results

0.1

1.0

1999System

2000System

No

rma

lize

d S

eg

me

nta

tio

n C

os

t

English - 10KW Deferral

Mandarin - 15KC Deferral

TDT First Story Detection (FSD) Task

Two Participants:National Taiwan University and University of Massachusetts

System Goal:– To detect the first story that discusses a topic,

for all topics.

• Evaluating “part” of a Topic Detection system, (i.e., when to start a new cluster)

First Stories on two topics

Not First Stories

= Topic 1= Topic 2

UMass Historical PerformanceUsing Reference Story Boundaries

0.76 0.64

0.10

1.00

10.00

UMass

Norm

aliz

ed F

SD

Cost 1999 System

2000 System

First Story Detection Results

• Required Condition: – English Newswire and Broadcast News ASR transcripts– Automatic story boundaries– One file decision deferral

0.1

1

10

NWT_Bnasr,Ref. Bnd

NWT+Bnasr,Auto Bnd

NWT+Bnman,Ref. Bnd.

Nor

mal

ized

FS

D C

ost

NTU1

UMass1

RequiredCondition

1999 and 2000 Topic Set Differences in FSD Evaluation

For UMass there is a slight difference,but a markeddifference for the NTU system

Summary

• Many, many things remaining to look at– Results appear to be a function of topic size and topic

set in the detection task, but it’s unclear why.• The re-usability of last year’s detection system outputs enable

valuable studies• Conditioned detection evaluation should be replaced with a

“contribution to cost” model

– Performance variability on tracking training stories should be further investigated

– …and the list goes on

• When should the annotations be released?• Need to find cost effective annotation technique

– Consider TREC ad-hoc style annotation via simulation

Recommended