Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University...

Zero Resource Spoken Term Detection on STD 06 dataset

Justin ChiuCarnegie Mellon University

07/24/2012, JHU

Motivation

• Given an unknown language, can you do unsupervised spoken term detection?

• Using high level representation, with some structural assumption, we can make the spoken term detection more robust– Query by example– Modeling– ASR Approach

Proposed Approach

• Signals• MFCC (13 dimension vector)– 10ms per frame, each frame represent 25ms

• Each utterance = A sequence of MFCC frames• Goal:– Cluster the MFCC frames– Represent each MFCC frame with cluster labels– Using SDTW algorithm perform term detection

Clustering

• K-mean clustering– 10 random start K-mean clustering– Store every cluster center as model

• Gaussian Mixture model– Clustering with Gaussian Mixtures– Store the mean and variance as model

• Cluster numbers decide by development data

Representation

• Hard representation (Vector -> Label)– Each audio file become sequence of cluster labels• 14 14 22 22 22 25 25 26 …

– Similar to text retrieval• Soft representation (Vector -> Vector)– Represent every MFCC frame as posterior

probability for every Gaussian Mixture– Better vector for distance measurement

Segmental Dynamic Time Warping

• Distance Measurement– Hard distance: match(0)/not match(1)– Soft distance: -log (a•q)

• Each jump:500ms x-y distance limitation: 500ms

a1 a2 a3 a4 a5 a6 a7 a8 a9

NIST STD 06 Data set

• One of the dataset used to evaluate Spoken Term Detection performance

• Advantage– Widely use because of 2006 STD Evaluation

Workshop, easy to compare with others • Disadvantage– Only text query provided, does not have any

spoken queries

Choosing the dataset

• 2006 STD Dataset has 3 different language– Each language (E,M,A) has different subset– We select English CTS (Conversational Telephone

Speech) dataset• Reason: It has most reported result

• Spoken query generation– Synthesized speech query: Flite– Extracted speech query: Extracted from dev set

Evaluation Measurement

• ATWV (Average Term Weighted Value)• Term-Weighted. Value (TWV) is one minus the

average value lost by the system per term.1 – Avg ( Pmiss + w * PFA)

• Reference ATWV number (Supervised):– English: 0.85– Mandarin: 0.38– Arabic: 0.34

Query Comparison

• Primary experiments on development set• Synthesized query– 1100• ATWV: <<0

• Extracted Query– 411 Extracted / combined queries• ATWV: -0.93

– 135 Longer query (Length>1)• ATWV:0.185

Evaluation Set Result

• Overrun by tides of false alarm

Further struggle

• Remove the first dimension in MFCC– Represent power of the speech, big value

• Inverted Frequency– If same frame appears too much time might be

less important (background noise)• Content-related bonus– Sequential same tag provide bonus

What we have learned

• Representing speech on every MFCC frame is too short

• Mismatch on the speech signal do affect a lot– Synthesized speech vs extracted speech

• Lots of false alarm happening for short query– At vs hat vs bat

Threshold

• How similar they are can let us decide they are the same word? (Detected or not)

• How many abstract representation unit we should use to represent unknown language?– Possibly can handle this with regularization

Representation

• We need to find better representation (Other than MFCC frame) to do the clustering– Phones works, appropriate representation should

work, expected to come from data-driven way• Advanced Approach for representation– Lee, Glass– Jenson, Church– SSS + clustering

Spoken Term Detection Experiments

• Dataset– NIST Spoken Term Detection 2006 Evaluation set– Advantage: • The dataset designed for STD task

• Evaluation Metrics– ATWV– Advantage:• Evaluation tool is available• Can compare with lots of supervised baseline

Summary

• Clustering on MFCC frame is an inappropriate representation for speech

• Need a better representation of speech unit• Channel/Speaker mismatch will harm the

performance a lot• The extracted spoken query and audio for

English CTS data is available.

Personal Belief in Zero Resource STD

SpeakerDependent

SpeakerIndependent

Special Thanks

• Alex Rudnicky

• Florian Metze• Alan Black• Rita Singh• Jack Mostow

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University...

Documents

Chiu Evolving

Osher at JHU

The Key Players Maria Nieto-Santisteban (JHU) Maria Nieto-Santisteban (JHU) Ani Thakar (JHU) Ani Thakar (JHU) Alex Szalay (JHU) Alex Szalay (JHU) Jim

Jhu Week 7

Luke Dones and Hal Levison Southwest Research Institute, … · (JHU/APL), Carey Lisse (JHU/APL), Javier Licandro (Instituto de Astrofísica de Canarias), Louise Prockter (JHU/APL),

Jhu Week 3

JHU Job Talk

JHU Hogan Thesis

Jhu Week 4

Social Networking Scenarios (JHU Assignment)

Morphology - jhu-intro-hlt.github.io

Jhu Week 10

Jasper Chiu

Rafael A. Irizarry Department of Biostatistics, JHU rafa@jhu

JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS ... · using the da Vinci Surgical System and consists of ... We refer to the dataset as the JHU-ISI Gesture and Skill Assessment

Jhu ppt community_2_0

Chiu Cervical

Jhu Identity Guidelines v1.1

AID JHU Annual Report

Ching-Wa Yip Johns Hopkins University. Alex Szalay (JHU) Rosemary Wyse (JHU) László Dobos (ELTE) Tamás Budavári (JHU) Istvan Csabai (ELTE)