Upload
fabio-fabio
View
213
Download
0
Tags:
Embed Size (px)
DESCRIPTION
http://fvalente.zxq.net/presentations/DataDrivenFeatures.pdf
Citation preview
Data-Driven Discriminative Speech Analysis Module in
DARPA GALE
Fabio Valente and Hynek Hermansky
IDIAP Research Institute, Martigny, Switzerland
Motivation
ASR requires knowledge and knowledge comes from data
– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)
– task-specific knowledge (e.g. language and its phonotactics, environment,…)
data-drivenfeatures
derived fromEnglish
classifier
train on small amountsof task-specific data
Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)
PROBLEM
For some tasks, amounts of data may be limited
ONE SOLUTION
Acquire speech-specific knowledge from large amounts of American English data
TANDEM and its training
TANDEMHermansky, Ellis and SharmaICASSP 2000
evid
ence
TANDEM
tran
sfor
med
phon
eme
post
erio
rs
trainingdata
training data for TANDEM : can be from other application domain
TANDEM trained on OGI stories
amount of task-specific training datafor training of the HMM models
wo
rd e
rro
r ra
te
WER on OGI digit data(Sivadas and Hermansky ICASSP 2002)
PLP
100%20 %
70%
0%
• preprocessing of input data for TANDEM is beneficial – e.g. TRAP technique (nonlinear and data-driven)
linearprocessing
evidence: anything that carries the relevant information
time
frequ
ency
featuresfor HMM trained NN
some functionof phoneme posteriors
TANDEM
evidence
posteriogram
/f/
/ay/
/v/
data
The Current Research• Where is the information?• study linear preprocessing using LDA
– data-driven technique– straightforward interpretation in terms of basis functions
time
freq
uenc
y
spectral projections
FIR RASTA filters
2-D projections
Applied earlier to American English portion of OGI stories (about 3 hours of telephone quality monologues from 210 adult talkers.
Hypothesis: If extracting speech-specific information, general conclusions should hold for different database (30 hours of RM and Switchboard from SRI)
Spectral Projections
Spectral sensitivity of projections
• Perturbation analysis– project Gaussian shape (σ = 250 Hz) on
the first 16 spectral basis and evaluate the effect of the shift in µ by 30 Hz as the function of µ
Consistent with auditory (Bark) spectral scale
log spectralEuclidean distancedue to the shift in µ
µ
Shift in µ constant on the Hz scale
Shift in µ constant on the Bark scale
µ
Relative importance of spectral regions
• Hilbert envelope of the first 15 spectral basis averaged
frequency [Hz]
• Importance of each frequency region for articulation and intelligibility [Fletcher 1953]
100 1000 10000frequency [Hz]
Central 600 ms of temporal discriminants(impulse responses of FIR RASTA filters)
time [ms]
ampl
itude
of
impu
lse
resp
onse
2-D discriminants
4.2 % 3.7 % 3.3% 3.2% 3.0%
2.8% 2..8% 2.7% 2.1% 2.0%
1.7% 1.6% 1.5% 1.4% 1.4%
1.2% 1.1% 1.1% 1.1% 1.1%
1.1% 1.1% 1.1% 1.1% 1.1%
freq
uenc
y [H
z]
o
4000
0-200 200time [ms]
Multi-RASTA
0-500 500time [ms]
time
averagefrequencyderivative
3 criticalbands
frequency
time
freq
uen
cy
example(out of 32 possible)
matched training and test miss-matched channel
conventional (PLP) 5.2 % 13.5 %
Multi-RASTA 3.7 % 3.8 %
Conclusions
• Data-driven (ANN-based TANDEM) feature extraction module as means for implementing speech-specific (task-independent) knowledge – aim for reducing need for large task-specific acoustic training data
• LDA guided pre-processingResults qualitatively consistent for different databases (OGI Stories
and forcefully aligned SRI Broadcast News and Switchboard data)– optimality of Bark-like frequency scale– need for larger (about 500 ms) temporal context in feature extraction – dominant time-frequency discriminants as outer products of spectral
discriminants and temporal discriminants
• Linear pre-processing of data for TANDEM– multi-RASTA (projections on zero-mean variable temporal resolution
basis)• demonstrated improvements on small vocabulary task
Initialize temporal basis R
Project Spectro-Temporal matrix on R
Estimate spectral basisL by LDA
Project Spectro-Temporal matrix on L
Estimate temporal basisR by LDA
2-D Linear Discriminant Analysis(Ye, Janardan and Li, NIPS 2005)
Eigenvalues
spectral discriminants temporal discriminants 2-D discriminants
Temporal discriminants across critical bands
first discriminant second discriminant third discriminant
• preprocessing of input data for TANDEM is beneficial – TRAP technique (nonlinear and data-driven)
– multi-RASTA filters (linear and “knowledge” guided)
linearprocessing
evidence: anything that carries the relevant information
time
frequ
ency
featuresfor HMM trained NN
some functionof phoneme posteriors
TANDEM
evidence
posteriogram
/f/
/ay/
/v/
data
Temporal discriminants(frequency responses of FIR RASTA filters)
modulation frequency [Hz]
log
mag
nitu
de s
pect
rum
[dB
]
Experimental Setup
• 30 hours of phoneme-labeled (forced alignment) data from SRI (Switchboard, broadcast news,…)
• spectral vectors: 129 samples of LPC log power spectrum (12th order, 30 ms window, 10 ms step)
• temporal vectors: 2010 ms long (201 samples), labeled by the phoneme in the center, 10 ms step
• spectro-temporal matrix: 201 time samples x 129 spectral samples, 10 ms step
Motivation
ASR requires knowledge and knowledge comes from data
– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)
– task-specific knowledge (e.g. language and its phonotactics, environment,…)
CONVENTIONAL WAY
features
classifier trained on
English
adapt on small amountsof task-specific data
ALTERNATIVE
data-drivenfeatures
derived fromEnglish
classifier
train on small amountsof task-specific data
Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)
PROBLEM
For some tasks, amounts of data may be limited
ONE SOLUTION
Acquire speech-specific knowledge from large amounts of American English data
Frequencies around 600 Hz are the most important for decoding of nonsense syllables