Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University...

Preview:

Citation preview

Zero Resource Spoken Term Detection on STD 06 dataset

Justin ChiuCarnegie Mellon University

07/24/2012, JHU

Motivation

• Given an unknown language, can you do unsupervised spoken term detection?

• Using high level representation, with some structural assumption, we can make the spoken term detection more robust– Query by example– Modeling– ASR Approach

Proposed Approach

• Signals• MFCC (13 dimension vector)– 10ms per frame, each frame represent 25ms

• Each utterance = A sequence of MFCC frames• Goal:– Cluster the MFCC frames– Represent each MFCC frame with cluster labels– Using SDTW algorithm perform term detection

Clustering

• K-mean clustering– 10 random start K-mean clustering– Store every cluster center as model

• Gaussian Mixture model– Clustering with Gaussian Mixtures– Store the mean and variance as model

• Cluster numbers decide by development data

Representation

• Hard representation (Vector -> Label)– Each audio file become sequence of cluster labels• 14 14 22 22 22 25 25 26 …

– Similar to text retrieval• Soft representation (Vector -> Vector)– Represent every MFCC frame as posterior

probability for every Gaussian Mixture– Better vector for distance measurement

Segmental Dynamic Time Warping

• Distance Measurement– Hard distance: match(0)/not match(1)– Soft distance: -log (a•q)

• Each jump:500ms x-y distance limitation: 500ms

a1 a2 a3 a4 a5 a6 a7 a8 a9

q1

q2

q3

q4

NIST STD 06 Data set

• One of the dataset used to evaluate Spoken Term Detection performance

• Advantage– Widely use because of 2006 STD Evaluation

Workshop, easy to compare with others • Disadvantage– Only text query provided, does not have any

spoken queries

Choosing the dataset

• 2006 STD Dataset has 3 different language– Each language (E,M,A) has different subset– We select English CTS (Conversational Telephone

Speech) dataset• Reason: It has most reported result

• Spoken query generation– Synthesized speech query: Flite– Extracted speech query: Extracted from dev set

Evaluation Measurement

• ATWV (Average Term Weighted Value)• Term-Weighted. Value (TWV) is one minus the

average value lost by the system per term.1 – Avg ( Pmiss + w * PFA)

• Reference ATWV number (Supervised):– English: 0.85– Mandarin: 0.38– Arabic: 0.34

Query Comparison

• Primary experiments on development set• Synthesized query– 1100• ATWV: <<0

• Extracted Query– 411 Extracted / combined queries• ATWV: -0.93

– 135 Longer query (Length>1)• ATWV:0.185

Evaluation Set Result

• Overrun by tides of false alarm

Further struggle

• Remove the first dimension in MFCC– Represent power of the speech, big value

• Inverted Frequency– If same frame appears too much time might be

less important (background noise)• Content-related bonus– Sequential same tag provide bonus

What we have learned

• Representing speech on every MFCC frame is too short

• Mismatch on the speech signal do affect a lot– Synthesized speech vs extracted speech

• Lots of false alarm happening for short query– At vs hat vs bat

Threshold

• How similar they are can let us decide they are the same word? (Detected or not)

• How many abstract representation unit we should use to represent unknown language?– Possibly can handle this with regularization

Representation

• We need to find better representation (Other than MFCC frame) to do the clustering– Phones works, appropriate representation should

work, expected to come from data-driven way• Advanced Approach for representation– Lee, Glass– Jenson, Church– SSS + clustering

Spoken Term Detection Experiments

• Dataset– NIST Spoken Term Detection 2006 Evaluation set– Advantage: • The dataset designed for STD task

• Evaluation Metrics– ATWV– Advantage:• Evaluation tool is available• Can compare with lots of supervised baseline

Summary

• Clustering on MFCC frame is an inappropriate representation for speech

• Need a better representation of speech unit• Channel/Speaker mismatch will harm the

performance a lot• The extracted spoken query and audio for

English CTS data is available.

Personal Belief in Zero Resource STD

SpeakerDependent

SpeakerIndependent

Special Thanks

• Alex Rudnicky

• Florian Metze• Alan Black• Rita Singh• Jack Mostow