20
Data-Driven Discriminative Speech Analysis Module in DARPA GALE Fabio Valente and Hynek Hermansky IDIAP Research Institute, Martigny, Switzerland

DataDrivenFeatures

Embed Size (px)

DESCRIPTION

http://fvalente.zxq.net/presentations/DataDrivenFeatures.pdf

Citation preview

Page 1: DataDrivenFeatures

Data-Driven Discriminative Speech Analysis Module in

DARPA GALE

Fabio Valente and Hynek Hermansky

IDIAP Research Institute, Martigny, Switzerland

Page 2: DataDrivenFeatures

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

Page 3: DataDrivenFeatures

TANDEM and its training

TANDEMHermansky, Ellis and SharmaICASSP 2000

evid

ence

TANDEM

tran

sfor

med

phon

eme

post

erio

rs

trainingdata

training data for TANDEM : can be from other application domain

TANDEM trained on OGI stories

amount of task-specific training datafor training of the HMM models

wo

rd e

rro

r ra

te

WER on OGI digit data(Sivadas and Hermansky ICASSP 2002)

PLP

100%20 %

70%

0%

Page 4: DataDrivenFeatures

• preprocessing of input data for TANDEM is beneficial – e.g. TRAP technique (nonlinear and data-driven)

linearprocessing

evidence: anything that carries the relevant information

time

frequ

ency

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

/f/

/ay/

/v/

data

Page 5: DataDrivenFeatures

The Current Research• Where is the information?• study linear preprocessing using LDA

– data-driven technique– straightforward interpretation in terms of basis functions

time

freq

uenc

y

spectral projections

FIR RASTA filters

2-D projections

Applied earlier to American English portion of OGI stories (about 3 hours of telephone quality monologues from 210 adult talkers.

Hypothesis: If extracting speech-specific information, general conclusions should hold for different database (30 hours of RM and Switchboard from SRI)

Page 6: DataDrivenFeatures

Spectral Projections

Page 7: DataDrivenFeatures

Spectral sensitivity of projections

• Perturbation analysis– project Gaussian shape (σ = 250 Hz) on

the first 16 spectral basis and evaluate the effect of the shift in µ by 30 Hz as the function of µ

Consistent with auditory (Bark) spectral scale

log spectralEuclidean distancedue to the shift in µ

µ

Shift in µ constant on the Hz scale

Shift in µ constant on the Bark scale

µ

Page 8: DataDrivenFeatures

Relative importance of spectral regions

• Hilbert envelope of the first 15 spectral basis averaged

frequency [Hz]

• Importance of each frequency region for articulation and intelligibility [Fletcher 1953]

100 1000 10000frequency [Hz]

Page 9: DataDrivenFeatures

Central 600 ms of temporal discriminants(impulse responses of FIR RASTA filters)

time [ms]

ampl

itude

of

impu

lse

resp

onse

Page 10: DataDrivenFeatures

2-D discriminants

4.2 % 3.7 % 3.3% 3.2% 3.0%

2.8% 2..8% 2.7% 2.1% 2.0%

1.7% 1.6% 1.5% 1.4% 1.4%

1.2% 1.1% 1.1% 1.1% 1.1%

1.1% 1.1% 1.1% 1.1% 1.1%

freq

uenc

y [H

z]

o

4000

0-200 200time [ms]

Page 11: DataDrivenFeatures

Multi-RASTA

0-500 500time [ms]

time

averagefrequencyderivative

3 criticalbands

frequency

time

freq

uen

cy

example(out of 32 possible)

matched training and test miss-matched channel

conventional (PLP) 5.2 % 13.5 %

Multi-RASTA 3.7 % 3.8 %

Page 12: DataDrivenFeatures

Conclusions

• Data-driven (ANN-based TANDEM) feature extraction module as means for implementing speech-specific (task-independent) knowledge – aim for reducing need for large task-specific acoustic training data

• LDA guided pre-processingResults qualitatively consistent for different databases (OGI Stories

and forcefully aligned SRI Broadcast News and Switchboard data)– optimality of Bark-like frequency scale– need for larger (about 500 ms) temporal context in feature extraction – dominant time-frequency discriminants as outer products of spectral

discriminants and temporal discriminants

• Linear pre-processing of data for TANDEM– multi-RASTA (projections on zero-mean variable temporal resolution

basis)• demonstrated improvements on small vocabulary task

Page 13: DataDrivenFeatures

Initialize temporal basis R

Project Spectro-Temporal matrix on R

Estimate spectral basisL by LDA

Project Spectro-Temporal matrix on L

Estimate temporal basisR by LDA

2-D Linear Discriminant Analysis(Ye, Janardan and Li, NIPS 2005)

Page 14: DataDrivenFeatures

Eigenvalues

spectral discriminants temporal discriminants 2-D discriminants

Page 15: DataDrivenFeatures

Temporal discriminants across critical bands

first discriminant second discriminant third discriminant

Page 16: DataDrivenFeatures

• preprocessing of input data for TANDEM is beneficial – TRAP technique (nonlinear and data-driven)

– multi-RASTA filters (linear and “knowledge” guided)

linearprocessing

evidence: anything that carries the relevant information

time

frequ

ency

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

/f/

/ay/

/v/

data

Page 17: DataDrivenFeatures

Temporal discriminants(frequency responses of FIR RASTA filters)

modulation frequency [Hz]

log

mag

nitu

de s

pect

rum

[dB

]

Page 18: DataDrivenFeatures

Experimental Setup

• 30 hours of phoneme-labeled (forced alignment) data from SRI (Switchboard, broadcast news,…)

• spectral vectors: 129 samples of LPC log power spectrum (12th order, 30 ms window, 10 ms step)

• temporal vectors: 2010 ms long (201 samples), labeled by the phoneme in the center, 10 ms step

• spectro-temporal matrix: 201 time samples x 129 spectral samples, 10 ms step

Page 19: DataDrivenFeatures

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

CONVENTIONAL WAY

features

classifier trained on

English

adapt on small amountsof task-specific data

ALTERNATIVE

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

Page 20: DataDrivenFeatures

Frequencies around 600 Hz are the most important for decoding of nonsense syllables