View
213
Download
1
Tags:
Embed Size (px)
Citation preview
HIWIRE Progress Report – July 2006
Technical University of CreteSpeech Processing and
Dialog Systems Group
Presenter: Alex Potamianos
Technical University of CreteSpeech Processing and
Dialog Systems Group
Presenter: Alex Potamianos
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration
Blind Speech Separation (BSS) problem
: mixing impulse response matrix
: spatial signature of the i-th speaker for lag τ
: additive noise vector
Objective: Estimate the inverse-channel impulse response matrix W(τ) from the observed signal
L : Channel order
Data Model – Problem Statement
BSS permutation problem
Permutation problem: “Order” of mics may be
different in the solution for each frequency bin
To solve permutation combine• Spatial constraints
• Continuity constraints in frequency domain
Solution to the permutation problem can be
formulated using• ILS minimization criterion
Recent progress
Improved solution to permutation problem
• Combining spatial and continuity constraints
• Trying out different continuity criteria
Created a synthetic database using typical room
impulse responses
First ASR experiments using the “synthetic” database
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration
Motivation
Combining classifiers/information sources is an
important problem in machine learning apps.
Simple, yet powerful, way to combine classifiers is
“multi-stream” approach; assumes independent
information sources
Unsupervised stream weight computation for multi-
stream classifiers is an open problem
Problem Definition
Compute “optimal” exponent weights for each stream s
[HMM Gaussian mixture formulation; similar expressions for MM,
naïve Bayes, Euclidean/Mahalonobois classifier]
Optimality in the sense of minimizing “total classification error”
Optimal Stream Weights: Result I
Equal error rate in single-stream classifiers
optimal stream weights are inversely proportional to the total stream estimation error variance
Optimal Stream Weights: Result II
Equal estimation error variance in each stream
optimal weights are approximately inversely proportional to the single stream classification error
Recent Progress
Experiments with synthetic data • Gaussian distribution classification problem)
• Results show good match with theoretical results
Experimental verification for Naïve Bayes classifiers• utterance classification - NLP application
First experiments with “unsupervised” estimates of
stream weights • “Intra-class” based metrics on observations
• AV-ASR application
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration
Dynamical System Segment Model
Based on linear dynamical system
Where x is state, y is observation, u control, w,v noise The system parameters should guarantee
• Identifiability, Controllability, Observability, Stability
We investigated more generalized parameter structures
1k k k
k k k
x Fx w
y Hx v
The system’s parameters have an identifiable canonical
form
• F: “ones” in the superdiagonal; remaining with “zeros”. Row ri
with free parameters (i=1,…,n)
• H: column dim. equal to F. Filled with “zeros”. Take r0=0 and
then row i have a “one” in column ri-1 + 1.
• P, R: filled with free parameters.
Propose a novel element-wise estimation based on EM
algorithm for systems identification.
Generalized forms of parameter structures
Application on speech Experiments on clean data from AURORA 2
11 word-models (one…nine, zero, oh)
No. of segments of each model depends on the No. of phones of the word-model
HTK for feature extraction (14 MFCCs)
Alignments taken by HTK using HMMs
4000 training sentences; 600 isolated words for testing
Results
Fig. (a) classification performance (using 3 different initializations)
Fig. (b) the log-likelihood is increasing for the same runs
Conclusions & Future Work Developed new forms of Linear State-space models Proposed a novel element-wise parameter
estimation process Performed training & classification on AURORA 2
based on speech segments and LDS Results shown correlation between performance and
initialization In the future:
• investigation of optimal initialization• Feature-segments alignment (through dynamic
programming)• Investigation of state space dimension
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration
Vocal Tract Length Normalization.
Linear and Non-Linear Frequency Warping.
Multi-Parameter Frequency Warping.
Warping and Spectral Bias Addition by ML Estimation.
Linear and Non-Linear Warping: Analysis An optimal warping factor a is computed (for each phoneme),
so that the Euclidean spectral distance (MSE) is minimized,
between the warped g(X) and the corresponding unwraped spectrum X. Optimization is achieved by full search
The mapped spectrum is warped according to this optimal warping factor.
2
1
1
N
iiai YgX
NMSE
2
1
)(1
minargˆ
N
iiaia YgX
Na
Linear and Non-Linear Warping Frequency Warping is implemented by re-sampling the spectral
envelope at linearly and nonlinearly frequency indices, i.e.
1. Linear
2. Piece-Wise Non-Linear
3. Power
1
12
3
,
,~:
Naga
aga~:
a
NNag
~:
Multi-Parameter Frequency Warping.
After the computation of the optimal warping factor, we
explore alternative piecewise linear frequency warping
strategies
Bi-Parametric Warping Function (2pts) Different warping factors are evaluated, for the low (F < 3
KHz) and high (F ≥ 3 KHz) frequencies.
Four-Parametric Warping Function (4pts) Different warping factors are evaluated for the frequency
ranges, 0-1.5, 1.5-3, 3-4.5 and 4.5-8 KHz.
Reduction in MSE: Non-linear warping
Reduction in MSE: Multi-parametric warping
Reduction in MSE: Bias Removal and Multi-parametric warping
Ongoing work
Implementation of “phone-dependent” warping in
HTK
Implementation of multi-parametric warping and bias
removal in HTK
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration
Optimal Bayes Adaptation
•Central problem is to determine
•Using Bayes rule we have
• 2 step process
•Obtain the priors from the SI models
•Compute the Likelihoods
| , | , |A t A t tp X s p X s p s
| tp s
A i tX | θ ,sp
1
| , | , | ,N
t t A t t A t
R
p x s X p x s p X s d
Number of
Dimensions (Cepstrum Coef)
Number of Mixture
Components
1 2 M 1 2 M
genone 1 genone 2
Phone-Based Clustering
• Cluster the output distributions based on common central phone
θ is every component of the above representation and stands for the prior
Our Implementation
|
|t st
t st
ts N
ts N
p s
pp s
• Computation of priors using :
• Computation of likelihoods by using Baum Welch algorithm and ML
• After computation of posterior probabilities we use smoothing
Such techniques are:
Flooring
Uniform
Delta
Outline
Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR
Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation
Work package 3• Task 1: Fixed platform integration