View
213
Download
1
Category
Preview:
Citation preview
Zero Resource Spoken Term Detection on STD 06 dataset
Justin ChiuCarnegie Mellon University
07/24/2012, JHU
Motivation
• Given an unknown language, can you do unsupervised spoken term detection?
• Using high level representation, with some structural assumption, we can make the spoken term detection more robust– Query by example– Modeling– ASR Approach
Proposed Approach
• Signals• MFCC (13 dimension vector)– 10ms per frame, each frame represent 25ms
• Each utterance = A sequence of MFCC frames• Goal:– Cluster the MFCC frames– Represent each MFCC frame with cluster labels– Using SDTW algorithm perform term detection
Clustering
• K-mean clustering– 10 random start K-mean clustering– Store every cluster center as model
• Gaussian Mixture model– Clustering with Gaussian Mixtures– Store the mean and variance as model
• Cluster numbers decide by development data
Representation
• Hard representation (Vector -> Label)– Each audio file become sequence of cluster labels• 14 14 22 22 22 25 25 26 …
– Similar to text retrieval• Soft representation (Vector -> Vector)– Represent every MFCC frame as posterior
probability for every Gaussian Mixture– Better vector for distance measurement
Segmental Dynamic Time Warping
• Distance Measurement– Hard distance: match(0)/not match(1)– Soft distance: -log (a•q)
• Each jump:500ms x-y distance limitation: 500ms
a1 a2 a3 a4 a5 a6 a7 a8 a9
q1
q2
q3
q4
NIST STD 06 Data set
• One of the dataset used to evaluate Spoken Term Detection performance
• Advantage– Widely use because of 2006 STD Evaluation
Workshop, easy to compare with others • Disadvantage– Only text query provided, does not have any
spoken queries
Choosing the dataset
• 2006 STD Dataset has 3 different language– Each language (E,M,A) has different subset– We select English CTS (Conversational Telephone
Speech) dataset• Reason: It has most reported result
• Spoken query generation– Synthesized speech query: Flite– Extracted speech query: Extracted from dev set
Evaluation Measurement
• ATWV (Average Term Weighted Value)• Term-Weighted. Value (TWV) is one minus the
average value lost by the system per term.1 – Avg ( Pmiss + w * PFA)
• Reference ATWV number (Supervised):– English: 0.85– Mandarin: 0.38– Arabic: 0.34
Query Comparison
• Primary experiments on development set• Synthesized query– 1100• ATWV: <<0
• Extracted Query– 411 Extracted / combined queries• ATWV: -0.93
– 135 Longer query (Length>1)• ATWV:0.185
Evaluation Set Result
• Overrun by tides of false alarm
Further struggle
• Remove the first dimension in MFCC– Represent power of the speech, big value
• Inverted Frequency– If same frame appears too much time might be
less important (background noise)• Content-related bonus– Sequential same tag provide bonus
What we have learned
• Representing speech on every MFCC frame is too short
• Mismatch on the speech signal do affect a lot– Synthesized speech vs extracted speech
• Lots of false alarm happening for short query– At vs hat vs bat
Threshold
• How similar they are can let us decide they are the same word? (Detected or not)
• How many abstract representation unit we should use to represent unknown language?– Possibly can handle this with regularization
Representation
• We need to find better representation (Other than MFCC frame) to do the clustering– Phones works, appropriate representation should
work, expected to come from data-driven way• Advanced Approach for representation– Lee, Glass– Jenson, Church– SSS + clustering
Spoken Term Detection Experiments
• Dataset– NIST Spoken Term Detection 2006 Evaluation set– Advantage: • The dataset designed for STD task
• Evaluation Metrics– ATWV– Advantage:• Evaluation tool is available• Can compare with lots of supervised baseline
Summary
• Clustering on MFCC frame is an inappropriate representation for speech
• Need a better representation of speech unit• Channel/Speaker mismatch will harm the
performance a lot• The extracted spoken query and audio for
English CTS data is available.
Personal Belief in Zero Resource STD
SpeakerDependent
SpeakerIndependent
Special Thanks
• Alex Rudnicky
• Florian Metze• Alan Black• Rita Singh• Jack Mostow
Recommended