Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington

Results of the 2000 Topic Detection and Tracking Evaluation

in Mandarin and English

Jonathan Fiscus and George Doddington

What’s New in TDT 2000

• TDT3 corpus used in both 1999 and 2000– 120 topics used in the 2000 test: 60 1999 topics, 60 new topics

• Of 44K news stories, 24% wereat least singly judged YES or BRIEF

• 1999 and 2000 topics are very different in terms of size and cross-language makeup

– Annotation of new topics using search engine-guided annotation:“Use a search engine with on-topic story feedback and interactive searching techniques to limit the number stories read by annotators”

• Evaluation Protocol Changes– Only minor changes to the Tracking (Negative example stories)– Link Detection test set selection changed in light of last year’s experience

0100020003000400050006000700080009000

10000

TD

T3

To

pic

s

19

99

To

pic

Se

t

20

00

To

pic

Se

tCo

un

t of O

n T

op

ic S

tori

es

MandarinEnglish

Search-Guided Annotation: How will it affect scores?

• Simulate search-guided annotation using 1999 topics, 1999 annotations and 1999 Systems

Effectiveness of Seach Guided Annotation

0.001

0.01

0.1

1

10

100

1000

0.1 1 10ROF (#Read On-topic/#Read Off-Topic)

SG (FB=10) Pjudged

SG (FB=10) #Reads

Probability of a human reading a

judged story

Number of read stories

Stability in“Region ofInterest”

TDT Topic Tracking Task

training data

test data

on-topicunknownunknown

7 Participants:– Dragon, IBM, Texas A&M Univ., TNO, Univ. of Iowa, Univ. of Massachusetts, Univ. of

Maryland

System Goal:– To detect stories that discuss the target topic,

in multiple source streams.• Supervised Training

– Given Nt sample stories that discuss a given target topic

• Testing– Find all subsequent stories that discuss the target topic

Topic Tracking Results

0.1

1

Nor

mal

ized

Tra

ckin

g C

ost

English Translation Native Orthography

0.1

1

Nor

mal

ized

Tra

ckin

g C

ost English Translation, Nt=4,Nn=0

Native Orthography, Nt=4 Nn=2

Native Orthography, Nt=4, Nn=0

With Negative

Without Negative

Basic Condition: Newswire + BNews, reference story boundaries, English training: 1 On-topic

Challenge Condition: Newswire + BNews ASR, automatic story boundaries, English training: 4 On-topic, 2 Negative training

Topic Tracking Results(Expanded Basic Condition DET Curve)

Topic Tracking Results (Expanded Challenge Condition DET Curve)

With Negative

Without Negative

Effect of Automatic Story Boundaries

• Evaluation conditioned jointly by source and Language– Newswire, Broadcast News, English and Mandarin

Degradation due to story

boundaries source for ASR

Test Condition: NWT+Bnasr, 4 English Training Stories, Reference Boundaries

IBM1 UMass1

0.093 0.119

0.010

0.100

1.000

1999

Top

icT

rain

ing

2000

Top

icT

rain

ingN

orm

aliz

ed T

rack

ing

Cos

t

0.01

0.1

1

0.01 0.1 1

1999 Topic Training

20

00

To

pic

Tra

inin

g

Variability of Tracking Performance Based on Training Stories

• BBN ran their 1999 system on this year’s index files:– Same topics, but different

training stories– One caveat: these results based

on different “test epochs”, 2000 index files contain more stories

• There could be several reasons for the difference

• …needs future investigation

NIST Speech Group

TDT Link Detection Task

One Participant: University of Massachusetts

System Goal:– To detect whether a pair of stories discuss the same topic.

(Can be thought of as a “primitive operator” to build a variety of applications)

?

2000 Link Detection Results• A lot was learned last year:

– The test set must be properly sampled • “Linked” story pairs were selected by randomly sampling all

possible on-topic story pairs• “Unlinked” pairs were selected using all on-topic stories as one

of the pair, and a randomly chosen story was chosen as the second

– This year, the task was made multilingual

– More story pairs were used

0

10000

20000

30000

40000

50000

60000

70000

Overall Eng-to-Eng Eng-to-Man Man-to-Man

Num

ber

of S

tory

Pai

rs Linked Pairs

UnLinked PairsLink Detection Test Set Composition

Link Detection Results

• Required Condition– Multilingual texts– Newswire + Broadcast News ASR,– Reference story boundaries– 10 file decision deferral

0.31

0.17

0.330.41

0.10

1.00

UMass1

Nor

mal

ized

Lin

k D

etec

tion

Cos

t

OverallEng-to-EngMan-to-ManEng-to-Man

Overall

TDT Topic Detection Task

Three ParticipantsChinese Univ. of Hong Kong, Dragon, Univ. of Massachusetts

System Goal:– To detect topics in terms of the (clusters of) stories

that discuss them.

• “Unsupervised” topic training

• New topics must be detected as the incoming stories are processed.

• Input stories are then associated with one of the topics.

a topic!

• Required Condition (in yellow)

– Multilingual Topic Detection– Newswire+Broadcast News ASR– Automatic Story Boundaries

• Performance on the 1999 and 2000 topic sets are different

2000 Topic Detection Evaluation

0.1

1

CUHK1 Dragon1 UMass1

No

rma

lize

d D

ete

ctio

n C

ost

1999 Topics 2000 Topics 1999+2000 Topics

0.1

1

CUHK1 Dragon1 UMass1

Nor

mal

ized

Det

ectio

n C

ost

1999 Topics 2000 Topics 1999+2000 Topics

Using English Translations for Mandarin Using Native Orthography

Effect of Topic Size on Detection Performance

• The 1999 topics have more on-topic stories than the 2000 topics

• Distribution of scores are related to topic size– Bigger topics tend to have higher scores. – Is this a behavior induced by setting a topic size parameter in

training?

0.0001

0.001

0.01

0.1

1

1 10 100 1000

Number of On-Topic Stories

Nor

mal

ize

d D

etec

tion

Co

st 1999 Topics

2000 TopicsDragon1ResultsNWT+BNasr,

Reference Boundary, Multilingual Texts

Fractional Components of Detection Cost

• Evaluations conditioned on factors (like language) are problematic

• Instead, compute the additive contributions to detection costs for different subsets of data.

00.005

0.010.015

0.020.025

0.03

Nor

mal

ized

Det

ectio

n C

ost

English

Mandarin

0.00001

0.0001

0.001

0.01

0.1

1E-05 1E-04 0.001 0.01 0.1

Fractional Cost of English Errors

Fra

ctio

nal

Co

st o

f Man

dar

in E

rro

rs

1999 Topics2000 Topics

Dragon1ResultsNWT+BNasr,

Reference Boundary, Multilingual Texts

InterestingReversal

Effects of Automatic BoundariesOn Detection Performance

– Multilingual Topic Detection– Newswire+Broadcast News ASR– Reference Vs. Automatic Story Boundaries

0.1

1

Dragon1 Dragon2 UMass1

Norm

aliz

ed D

ete

ctio

n

Cost

Reference Boundaries

Automatic Boundaries

19%, 21% and 41%relative increase in

cost respectively

TDT Segmentation Task

Transcription:

text (words)Story:Non-story:

One Participant: MITRE(For TDT 2000, Story segmentation is an integral part of the other tasks, not just a separate evaluation task)

System Goal:– To segment the source stream into its constituent stories, for all audio sources.

Story Segmentation Results

• Required Condition: – Broadcast News ASR

MITRE Segmentation Results

0.1

1.0

1999System

2000System

No

rma

lize

d S

eg

me

nta

tio

n C

os

t

English - 10KW Deferral

Mandarin - 15KC Deferral

TDT First Story Detection (FSD) Task

Two Participants:National Taiwan University and University of Massachusetts

System Goal:– To detect the first story that discusses a topic,

for all topics.

• Evaluating “part” of a Topic Detection system, (i.e., when to start a new cluster)

First Stories on two topics

Not First Stories

= Topic 1= Topic 2

UMass Historical PerformanceUsing Reference Story Boundaries

0.76 0.64

0.10

1.00

10.00

UMass

Norm

aliz

ed F

SD

Cost 1999 System

2000 System

First Story Detection Results

• Required Condition: – English Newswire and Broadcast News ASR transcripts– Automatic story boundaries– One file decision deferral

0.1

1

10

NWT_Bnasr,Ref. Bnd

NWT+Bnasr,Auto Bnd

NWT+Bnman,Ref. Bnd.

Nor

mal

ized

FS

D C

ost

NTU1

UMass1

RequiredCondition

1999 and 2000 Topic Set Differences in FSD Evaluation

For UMass there is a slight difference,but a markeddifference for the NTU system

Summary

• Many, many things remaining to look at– Results appear to be a function of topic size and topic

set in the detection task, but it’s unclear why.• The re-usability of last year’s detection system outputs enable

valuable studies• Conditioned detection evaluation should be replaced with a

“contribution to cost” model

– Performance variability on tracking training stories should be further investigated

– …and the list goes on

• When should the annotations be released?• Need to find cost effective annotation technique

– Consider TREC ad-hoc style annotation via simulation

Documents

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington