Upload
traci
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop. July 13, 2005 Royal College of Physicians Edinburgh, UK. Meeting Venue Cleared. 18:00. Today’s Agenda. Updated: July 5, 2005. Administrative Points. Participants: - PowerPoint PPT Presentation
Citation preview
Welcome to the Rich Transcription 2005 Spring
Meeting Recognition Evaluation Workshop
July 13, 2005
Royal College of Physicians Edinburgh, UK
Today’sAgenda
Administrative Points• Participants:
– Pick up the hard copy proceedings on the front desk
• Presenters:– The agenda will be strictly followed
• Time slots include Q&A time.
– Presenters should either• Load their presentations on the computer at the front, or• Test their laptops during the breaks prior to making their presentation
• We’d like to thank:– MLMI-05 organizing committee for hosting this workshop – Caroline Hastings for the workshop’s administration– All the volunteers: evaluation participants, data providers,
transcribers, annotators, paper authors, presenters and other contributors
The Rich Transcription 2005 Spring Meeting
Recognition Evaluation
Jonathan Fiscus, Nicolas Radde, John Garofolo, Audrey Le, Jerome Ajot, Christophe Laprun
July 13, 2005
Rich Transcription 2004 Spring Meeting Recognition Workshop
at MLMI 2005
http://www.nist.gov/speech/tests/rt/rt2005/spring/
Overview
• Rich Transcription Evaluation Series
• Research opportunities in the Meeting Domain
• RT-05S Evaluation– Audio input conditions– Corpora– Evaluation tasks and results
• Conclusion/Future
The Rich Transcription Task
ComponentRecognition Technologies
Multiple Applications
RICH TRANSCRIPTIONSpeech-To-Text + METADATA
Smart Meeting RoomsTranslationExtractionRetrieval
Summarization
ReadableTranscriptsHuman-to-Human
Speech
Rich Transcription Evaluation Series
• Goal:– Develop recognition technologies that produce transcripts
which are understandable by humans and useful for downstream processes.
• Domains:– Broadcast News (BN)– Conversational Telephone Speech (CTS)– Meeting Room speech
• Parameterized “Black Box” evaluations– Evaluations control input conditions to investigate
weaknesses/strengths– Sub-test scoring provides finer-grained diagnostics
Research Opportunities in the Meeting Domain
• Provide fertile environment to advance state-of-the-art in technologies for understanding human interaction
• Many potential applications– Meeting archives, interactive meeting rooms, remote collaborative systems
• Important Human Language Technology challenges not posed by other domains– Varied forums and vocabularies– Highly interactive and overlapping spontaneous speech – Far field speech effects
• Ambient noise• Reverberation• Participant movement
– Varied room configurations• Many microphone conditions • Many camera views
– Multimedia information integration• Person, face, and head detection/tracking
RT-05S Evaluation Tasks
• Focus on core speech technologies– Speech-to-Text Transcription– Diarization “Who Spoke When”– Diarization “Speech Activity Detection”– Diarization “Source Localization”
Five System Input Conditions• Distant microphone conditions
– Multiple Distant Microphones (MDM)• Three or more centrally located table mics
– Multiple Source Localization Arrays (MSLA)• Inverted “T” topology, 4-channel digital microphone array
– Multiple Mark III digital microphone Arrays (MM3A)• Linear topology, 64-channel digital microphone array
• Contrastive microphone conditions– Single Distant Microphone (SDM)
• Center-most MDM microphone • Gauge performance benefit using multiple table mics
– Individual Head Microphones (IHM)• Performance on clean speech • Similar to Conversational Telephone Speech
– One speaker per channel, conversational speech
Training/Development Corpora• Corpora provided at no cost to participants
– ICSI Meeting Corpus– ISL Meeting Corpus– NIST Meeting Pilot Corpus– Rich Transcription 2004 Spring (RT-04S) Development &
Evaluation Data– Topic Detection and Tracking Phase 4 (TDT4) corpus – Fisher English conversational telephone speech corpus– CHIL development test set – AMI development test set and training set
• Thanks to ELDA and LDC for making this possible
RT-05S Evaluation Test Corpora:Conference Room Test Set
• Goal-oriented small conference room meetings– Group meetings and decision-making exercises– Meetings involved 4-10 participants
• 120 minutes – Ten excerpts, each twelve minutes in duration– Five sites donated two meetings each:
• Augmented Multiparty Interaction (AMI) Program, International Computer Science Institute (ICSI), NIST, and Virginia Tech (VT)
– No VT data was available for system development– Similar test set construction used for RT-04S evaluation
• Microphones:– Participants wore head microphones– Microphones were placed on the table among participants– AMI meetings included an 8-channel circular microphone array on the
table– NIST meetings include 3 Mark III digital microphone arrays
RT-05S Evaluation Test Corpora: Lecture Room Test Set
• Technical lectures in small meeting rooms– Educational events where a single lecturer is briefing an audience on
a particular topic – Meetings excerpts involve one lecturer and up to five participating
audience members
• 150 minutes – 29 excerpts from 16 lectures– Two types of excerpts selected by CMU
• Lecturer excerpts – 89 minutes, 17 excerpts • Question & Answer (Q&A) excerpts – 61 minutes, 12 excerpts
• All data collected at Karlsruhe University• Sensors:
– Lecturer and at most two other participants wore head microphones– Microphones were placed on the table among participants– A source localization array mounted on each of the room’s four walls– Mark III mounted on the wall opposite the lecturer
RT-05S Evaluation ParticipantsSite ID Site Name Evaluation Task
STT SPKR SAD SLOCAMI Augmented Multiparty Interaction Program X
ICSI/SRI International Computer Science Institute and SRI International
X X
ITC-irst Center for Scientific and Technological Research
X
KU Karlsruhe University X
ELISA Consortium
Laboratoire Informatique d'Avignon (LIA), Communication Langagière et Interaction Personne-Système (CLIPS), and LIUM
X X
MQU Macquarie University X
Purdue Purdue University X
TNO The Netherlands Organisation for Applied Scientific Research
X X
TUT Tampere University of Technology X
Diarization “Who Spoke When” (SPKR) Task
• Task definition– Identify the number of participants in each
meeting and create a list of speech time intervals for each such participant
• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA
• Four participating sites: ICSI/SRI, ELISA, MQU, TNO
SPKR System Evaluation Method• Primary Metric
– Diarization Error Rate (DER) – the ratio of incorrectly detected speaker time to total speaker time
• System output speaker segment sets are mapped to reference speaker segment sets so as to minimize the total error
• Errors consist of:– Speaker assignment errors (i.e., detected speech but not assigned to
the right speaker) – False alarm detections– Missed detections
• Systems were scored using the mdeval tool– Forgiveness collar of +/- 250ms around reference segment
boundaries
• DER on non-overlapping speech is the primary metric
RT-05S SPKR ResultsPrimary Systems, Non-Overlapping Speech
• Conference room SDM DER less than MDM– Sign test indicates differences are not significant
• Primary ICSI/SRI Lecture Room system attributed the entire duration of each test excerpt to be from a single speaker. – ICSI/SRI contrastive system had a lower DER
0
10
20
30
40
50
60
MDM SDM MDM MSLA SDM
Conference Room Lecture Room
DE
R
ICSI/SRI
ELISA
MQU
TNO
Lecture Room Results:Broken Down by Excerpt Type
• Lecturer excerpt DERs are lower than Q&A excerpt DERs
0
5
10
15
20
25
30
35
Lec
ture
rE
xcer
pts
Q&
AE
xcer
pts
All
Dat
a
DE
R (
%)
ICSI/SRI ContrastiveELISAICSI/SRI Primary
Historical Best System SPKR Performance on Conference Data
• 20% relative reduction for MDM
• 43% relative reduction for SDM
23.3
27.5
18.6
15.3
0
5
10
15
20
25
30
MDM SDM
DE
R RT-04S
RT-05S
Diarization “Speech Activity Detection” (SAD) Task
• Task definition– create a list of speech time intervals where at least one
person is talking
• Dry run evaluation for RT-05S– Proposed by CHIL
• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA, IHM
• Systems designed for the IHM condition must detect speech and also reject cross talk speech and breath noises, therefore IHM systems are not directly comparable to MDM or SDM systems
• Three participating sites: ELISA, Purdue, TNO
SAD System Evaluation Method
• Primary metric – Diarization Error Rate (DER)
• Same formula and software as used for the SPKR task• Reduced to a two-class problem: speech vs. non-
speech• No speaker assignment errors, just false alarms and
missed detections– Forgiveness collar of +/- 250ms around reference
segment boundaries
RT-05S SAD ResultsPrimary Systems
• DERs for conference and lecture room MDM data are similar• Purdue didn’t compensate for breath noise and crosstalk
7.42% 6.59%
26.97%
5.04%
0%
5%
10%
15%
20%
25%
30%
MDM IHM MDM
Conference Room Lecture Room
DE
R
ELISA
Purdue
TNO
Speech-To-Text (STT) Task
• Task definition– Systems output a single stream of time-tagged
word tokens
• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA, IHM
• Two participating sites: AMI and ICSI/SRI
STT System Evaluation Method• Primary metric
– Word Error Rate (WER) - ratio of inserted, deleted, and substituted words to the total number of words in the reference
• System and reference words are normalized to a common form• System words are mapped to reference words using a word-
mediated dynamic programming string alignment program
• Systems were scored using the NIST Scoring Toolkit (SCTK) version 2.1– A Spring 2005 update to the SCTK alignment tool can now
score most of the overlapping speech in the distant microphone test material
• Can now handle up to 5 simultaneous speakers– 98% of Conference Room test can be scored– 100% of Lecture Room test set can be scored
– Greatly improved over Spring 2004 prototype
RT-05S STT ResultsPrimary Systems (Incl. overlaps)
• First evaluation for the AMI team• IHM error rates for conference and lecture room data are comparable• ICSI/SRI lecture room MSLA WER lower than MDM/SDM WER
0%
10%
20%
30%
40%
50%
60%
MDM SDM IHM MDM SDM IHM MSLA
Conference Room Lecture Room
WE
R
AMIICSI/SRI
Microphone conditions
Conference Room
Lecture Room
Historical STT Performance in the Meeting Domain
• Performance for ICSI/SRI has dramatically improved for all conditions
0102030405060
1 Spkr. <= 5Spkr.
1 Spkr. <= 5Spkr.
All
MDM SDM IHM
WE
R
RT-04S
RT-05S
Diarization “Source Localization” (SLOC) Task
• Task definition– Systems track the three-dimensional position of the lecturer (using
audio input only)– Constrained to lecturer subset of the Lecture Room test set– Evaluation protocol and metrics defined in the CHIL “Speaker
Localization and Tracking – Evaluation Criteria” document
• Dry run pilot evaluation for RT-05S– Proposed by CHIL– CHIL provided the scoring software and annotated the evaluation data
• One evaluation condition– Multiple source localization arrays
• Required calibration of source localization microphone positions and video cameras
• Three participating sites: ITC-irst, KU, TNO
SLOC System Evaluation Method
• Primary Metric:– Root Mean Squared Error (RMSE) – a measure
of the average Euclidean distance between the reference speaker position and the system-determined speaker position
• Measured in millimeters at 667 ms intervals
• IRST SLOC scoring software
►Maurizio Omologo will give further details this afternoon
R-05S SLOC ResultsPrimary Systems
• Issues:– What accuracy and
resolution is needed for successful beamforming?
– What will performance be for multiple speakers?
0
100
200
300
400
500
600
700
800
900
ITC-irst KU TUT
RM
SE
(fi
ne+
gro
ss)
mm
Summary
• Nine sites participated in the RT-05S evaluation– Up from six in RT-04S
• Four evaluation tasks were supported across two meeting sub-domains: – Two experimental tasks: SAD and SLOC
successfully completed– Dramatically lower STT and SPKR error rates for
RT-05S
Issues for RT-06 Meeting Eval• Domain
– Sub domains
• Tasks– Require at least three sites per task– Agreed-upon primary condition for each task
• Data contributions– Source data and annotations
• Participation intent • Participation commitment• Decision making process
– Only sites with intent to participate will have input to the task definition