32
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

HIWIRE Progress Report – July 2006

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos

Technical University of CreteSpeech Processing and

Dialog Systems Group

Presenter: Alex Potamianos

Page 2: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration

Page 3: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Blind Speech Separation (BSS) problem

Page 4: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

: mixing impulse response matrix

: spatial signature of the i-th speaker for lag τ

: additive noise vector

Objective: Estimate the inverse-channel impulse response matrix W(τ) from the observed signal

L : Channel order

Data Model – Problem Statement

Page 5: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

BSS permutation problem

Permutation problem: “Order” of mics may be

different in the solution for each frequency bin

To solve permutation combine• Spatial constraints

• Continuity constraints in frequency domain

Solution to the permutation problem can be

formulated using• ILS minimization criterion

Page 6: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Recent progress

Improved solution to permutation problem

• Combining spatial and continuity constraints

• Trying out different continuity criteria

Created a synthetic database using typical room

impulse responses

First ASR experiments using the “synthetic” database

Page 7: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration

Page 8: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Motivation

Combining classifiers/information sources is an

important problem in machine learning apps.

Simple, yet powerful, way to combine classifiers is

“multi-stream” approach; assumes independent

information sources

Unsupervised stream weight computation for multi-

stream classifiers is an open problem

Page 9: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Problem Definition

Compute “optimal” exponent weights for each stream s

[HMM Gaussian mixture formulation; similar expressions for MM,

naïve Bayes, Euclidean/Mahalonobois classifier]

Optimality in the sense of minimizing “total classification error”

Page 10: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Optimal Stream Weights: Result I

Equal error rate in single-stream classifiers

optimal stream weights are inversely proportional to the total stream estimation error variance

Page 11: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Optimal Stream Weights: Result II

Equal estimation error variance in each stream

optimal weights are approximately inversely proportional to the single stream classification error

Page 12: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Recent Progress

Experiments with synthetic data • Gaussian distribution classification problem)

• Results show good match with theoretical results

Experimental verification for Naïve Bayes classifiers• utterance classification - NLP application

First experiments with “unsupervised” estimates of

stream weights • “Intra-class” based metrics on observations

• AV-ASR application

Page 13: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration

Page 14: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Dynamical System Segment Model

Based on linear dynamical system

Where x is state, y is observation, u control, w,v noise The system parameters should guarantee

• Identifiability, Controllability, Observability, Stability

We investigated more generalized parameter structures

1k k k

k k k

x Fx w

y Hx v

Page 15: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

The system’s parameters have an identifiable canonical

form

• F: “ones” in the superdiagonal; remaining with “zeros”. Row ri

with free parameters (i=1,…,n)

• H: column dim. equal to F. Filled with “zeros”. Take r0=0 and

then row i have a “one” in column ri-1 + 1.

• P, R: filled with free parameters.

Propose a novel element-wise estimation based on EM

algorithm for systems identification.

Generalized forms of parameter structures

Page 16: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Application on speech Experiments on clean data from AURORA 2

11 word-models (one…nine, zero, oh)

No. of segments of each model depends on the No. of phones of the word-model

HTK for feature extraction (14 MFCCs)

Alignments taken by HTK using HMMs

4000 training sentences; 600 isolated words for testing

Page 17: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Results

Fig. (a) classification performance (using 3 different initializations)

Fig. (b) the log-likelihood is increasing for the same runs

Page 18: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Conclusions & Future Work Developed new forms of Linear State-space models Proposed a novel element-wise parameter

estimation process Performed training & classification on AURORA 2

based on speech segments and LDS Results shown correlation between performance and

initialization In the future:

• investigation of optimal initialization• Feature-segments alignment (through dynamic

programming)• Investigation of state space dimension

Page 19: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration

Page 20: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Vocal Tract Length Normalization.

Linear and Non-Linear Frequency Warping.

Multi-Parameter Frequency Warping.

Warping and Spectral Bias Addition by ML Estimation.

Page 21: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Linear and Non-Linear Warping: Analysis An optimal warping factor a is computed (for each phoneme),

so that the Euclidean spectral distance (MSE) is minimized,

between the warped g(X) and the corresponding unwraped spectrum X. Optimization is achieved by full search

The mapped spectrum is warped according to this optimal warping factor.

2

1

1

N

iiai YgX

NMSE

2

1

)(1

minargˆ

N

iiaia YgX

Na

Page 22: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Linear and Non-Linear Warping Frequency Warping is implemented by re-sampling the spectral

envelope at linearly and nonlinearly frequency indices, i.e.

1. Linear

2. Piece-Wise Non-Linear

3. Power

1

12

3

,

,~:

Naga

aga~:

a

NNag

~:

Page 23: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Multi-Parameter Frequency Warping.

After the computation of the optimal warping factor, we

explore alternative piecewise linear frequency warping

strategies

Bi-Parametric Warping Function (2pts) Different warping factors are evaluated, for the low (F < 3

KHz) and high (F ≥ 3 KHz) frequencies.

Four-Parametric Warping Function (4pts) Different warping factors are evaluated for the frequency

ranges, 0-1.5, 1.5-3, 3-4.5 and 4.5-8 KHz.

Page 24: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Reduction in MSE: Non-linear warping

Page 25: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Reduction in MSE: Multi-parametric warping

Page 26: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Reduction in MSE: Bias Removal and Multi-parametric warping

Page 27: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Ongoing work

Implementation of “phone-dependent” warping in

HTK

Implementation of multi-parametric warping and bias

removal in HTK

Page 28: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration

Page 29: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Optimal Bayes Adaptation

•Central problem is to determine

•Using Bayes rule we have

• 2 step process

•Obtain the priors from the SI models

•Compute the Likelihoods

| , | , |A t A t tp X s p X s p s

| tp s

A i tX | θ ,sp

1

| , | , | ,N

t t A t t A t

R

p x s X p x s p X s d

Page 30: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Number of

Dimensions (Cepstrum Coef)

Number of Mixture

Components

1 2 M 1 2 M

genone 1 genone 2

Phone-Based Clustering

• Cluster the output distributions based on common central phone

θ is every component of the above representation and stands for the prior

Page 31: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Our Implementation

|

|t st

t st

ts N

ts N

p s

pp s

• Computation of priors using :

• Computation of likelihoods by using Baum Welch algorithm and ML

• After computation of posterior probabilities we use smoothing

Such techniques are:

Flooring

Uniform

Delta

Page 32: HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University

Outline

Work package 1• Task 1:Blind Source Separation for ASR• Task 2,5: Feature extraction and fusion • Task 4: Segment models for ASR

Work package 2• Task 1,2: VTLN• Task 2: Bayes optimal adaptation

Work package 3• Task 1: Fixed platform integration