Upload
horatio-sharp
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Results of the 2000 Topic Detection and Tracking Evaluation
in Mandarin and English
Jonathan Fiscus and George Doddington
What’s New in TDT 2000
• TDT3 corpus used in both 1999 and 2000– 120 topics used in the 2000 test: 60 1999 topics, 60 new topics
• Of 44K news stories, 24% wereat least singly judged YES or BRIEF
• 1999 and 2000 topics are very different in terms of size and cross-language makeup
– Annotation of new topics using search engine-guided annotation:“Use a search engine with on-topic story feedback and interactive searching techniques to limit the number stories read by annotators”
• Evaluation Protocol Changes– Only minor changes to the Tracking (Negative example stories)– Link Detection test set selection changed in light of last year’s experience
0100020003000400050006000700080009000
10000
TD
T3
To
pic
s
19
99
To
pic
Se
t
20
00
To
pic
Se
tCo
un
t of O
n T
op
ic S
tori
es
MandarinEnglish
Search-Guided Annotation: How will it affect scores?
• Simulate search-guided annotation using 1999 topics, 1999 annotations and 1999 Systems
Effectiveness of Seach Guided Annotation
0.001
0.01
0.1
1
10
100
1000
0.1 1 10ROF (#Read On-topic/#Read Off-Topic)
SG (FB=10) Pjudged
SG (FB=10) #Reads
Probability of a human reading a
judged story
Number of read stories
Stability in“Region ofInterest”
TDT Topic Tracking Task
training data
test data
on-topicunknownunknown
7 Participants:– Dragon, IBM, Texas A&M Univ., TNO, Univ. of Iowa, Univ. of Massachusetts, Univ. of
Maryland
System Goal:– To detect stories that discuss the target topic,
in multiple source streams.• Supervised Training
– Given Nt sample stories that discuss a given target topic
• Testing– Find all subsequent stories that discuss the target topic
Topic Tracking Results
0.1
1
Nor
mal
ized
Tra
ckin
g C
ost
English Translation Native Orthography
0.1
1
Nor
mal
ized
Tra
ckin
g C
ost English Translation, Nt=4,Nn=0
Native Orthography, Nt=4 Nn=2
Native Orthography, Nt=4, Nn=0
With Negative
Without Negative
Basic Condition: Newswire + BNews, reference story boundaries, English training: 1 On-topic
Challenge Condition: Newswire + BNews ASR, automatic story boundaries, English training: 4 On-topic, 2 Negative training
Topic Tracking Results(Expanded Basic Condition DET Curve)
Topic Tracking Results (Expanded Challenge Condition DET Curve)
With Negative
Without Negative
Effect of Automatic Story Boundaries
• Evaluation conditioned jointly by source and Language– Newswire, Broadcast News, English and Mandarin
Degradation due to story
boundaries source for ASR
Test Condition: NWT+Bnasr, 4 English Training Stories, Reference Boundaries
IBM1 UMass1
0.093 0.119
0.010
0.100
1.000
1999
Top
icT
rain
ing
2000
Top
icT
rain
ingN
orm
aliz
ed T
rack
ing
Cos
t
0.01
0.1
1
0.01 0.1 1
1999 Topic Training
20
00
To
pic
Tra
inin
g
Variability of Tracking Performance Based on Training Stories
• BBN ran their 1999 system on this year’s index files:– Same topics, but different
training stories– One caveat: these results based
on different “test epochs”, 2000 index files contain more stories
• There could be several reasons for the difference
• …needs future investigation
NIST Speech Group
TDT Link Detection Task
One Participant: University of Massachusetts
System Goal:– To detect whether a pair of stories discuss the same topic.
(Can be thought of as a “primitive operator” to build a variety of applications)
?
2000 Link Detection Results• A lot was learned last year:
– The test set must be properly sampled • “Linked” story pairs were selected by randomly sampling all
possible on-topic story pairs• “Unlinked” pairs were selected using all on-topic stories as one
of the pair, and a randomly chosen story was chosen as the second
– This year, the task was made multilingual
– More story pairs were used
0
10000
20000
30000
40000
50000
60000
70000
Overall Eng-to-Eng Eng-to-Man Man-to-Man
Num
ber
of S
tory
Pai
rs Linked Pairs
UnLinked PairsLink Detection Test Set Composition
Link Detection Results
• Required Condition– Multilingual texts– Newswire + Broadcast News ASR,– Reference story boundaries– 10 file decision deferral
0.31
0.17
0.330.41
0.10
1.00
UMass1
Nor
mal
ized
Lin
k D
etec
tion
Cos
t
OverallEng-to-EngMan-to-ManEng-to-Man
Overall
TDT Topic Detection Task
Three ParticipantsChinese Univ. of Hong Kong, Dragon, Univ. of Massachusetts
System Goal:– To detect topics in terms of the (clusters of) stories
that discuss them.
• “Unsupervised” topic training
• New topics must be detected as the incoming stories are processed.
• Input stories are then associated with one of the topics.
a topic!
• Required Condition (in yellow)
– Multilingual Topic Detection– Newswire+Broadcast News ASR– Automatic Story Boundaries
• Performance on the 1999 and 2000 topic sets are different
2000 Topic Detection Evaluation
0.1
1
CUHK1 Dragon1 UMass1
No
rma
lize
d D
ete
ctio
n C
ost
1999 Topics 2000 Topics 1999+2000 Topics
0.1
1
CUHK1 Dragon1 UMass1
Nor
mal
ized
Det
ectio
n C
ost
1999 Topics 2000 Topics 1999+2000 Topics
Using English Translations for Mandarin Using Native Orthography
Effect of Topic Size on Detection Performance
• The 1999 topics have more on-topic stories than the 2000 topics
• Distribution of scores are related to topic size– Bigger topics tend to have higher scores. – Is this a behavior induced by setting a topic size parameter in
training?
0.0001
0.001
0.01
0.1
1
1 10 100 1000
Number of On-Topic Stories
Nor
mal
ize
d D
etec
tion
Co
st 1999 Topics
2000 TopicsDragon1ResultsNWT+BNasr,
Reference Boundary, Multilingual Texts
Fractional Components of Detection Cost
• Evaluations conditioned on factors (like language) are problematic
• Instead, compute the additive contributions to detection costs for different subsets of data.
00.005
0.010.015
0.020.025
0.03
Nor
mal
ized
Det
ectio
n C
ost
English
Mandarin
0.00001
0.0001
0.001
0.01
0.1
1E-05 1E-04 0.001 0.01 0.1
Fractional Cost of English Errors
Fra
ctio
nal
Co
st o
f Man
dar
in E
rro
rs
1999 Topics2000 Topics
Dragon1ResultsNWT+BNasr,
Reference Boundary, Multilingual Texts
InterestingReversal
Effects of Automatic BoundariesOn Detection Performance
– Multilingual Topic Detection– Newswire+Broadcast News ASR– Reference Vs. Automatic Story Boundaries
0.1
1
Dragon1 Dragon2 UMass1
Norm
aliz
ed D
ete
ctio
n
Cost
Reference Boundaries
Automatic Boundaries
19%, 21% and 41%relative increase in
cost respectively
TDT Segmentation Task
Transcription:
text (words)Story:Non-story:
One Participant: MITRE(For TDT 2000, Story segmentation is an integral part of the other tasks, not just a separate evaluation task)
System Goal:– To segment the source stream into its constituent stories, for all audio sources.
Story Segmentation Results
• Required Condition: – Broadcast News ASR
MITRE Segmentation Results
0.1
1.0
1999System
2000System
No
rma
lize
d S
eg
me
nta
tio
n C
os
t
English - 10KW Deferral
Mandarin - 15KC Deferral
TDT First Story Detection (FSD) Task
Two Participants:National Taiwan University and University of Massachusetts
System Goal:– To detect the first story that discusses a topic,
for all topics.
• Evaluating “part” of a Topic Detection system, (i.e., when to start a new cluster)
First Stories on two topics
Not First Stories
= Topic 1= Topic 2
UMass Historical PerformanceUsing Reference Story Boundaries
0.76 0.64
0.10
1.00
10.00
UMass
Norm
aliz
ed F
SD
Cost 1999 System
2000 System
First Story Detection Results
• Required Condition: – English Newswire and Broadcast News ASR transcripts– Automatic story boundaries– One file decision deferral
0.1
1
10
NWT_Bnasr,Ref. Bnd
NWT+Bnasr,Auto Bnd
NWT+Bnman,Ref. Bnd.
Nor
mal
ized
FS
D C
ost
NTU1
UMass1
RequiredCondition
1999 and 2000 Topic Set Differences in FSD Evaluation
For UMass there is a slight difference,but a markeddifference for the NTU system
Summary
• Many, many things remaining to look at– Results appear to be a function of topic size and topic
set in the detection task, but it’s unclear why.• The re-usability of last year’s detection system outputs enable
valuable studies• Conditioned detection evaluation should be replaced with a
“contribution to cost” model
– Performance variability on tracking training stories should be further investigated
– …and the list goes on
• When should the annotations be released?• Need to find cost effective annotation technique
– Consider TREC ad-hoc style annotation via simulation